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The Simplest ‘‘Crossover’’ or ‘‘Switchover’’ 
Design 





MANY psychological studies deal with the 
comparability of mean scores on various tests, 
such as IQ’s for the Revised Stanford-Binet and 
the Wechsler Intelligence Scale for Children 
(WISC). If a random half of the available sub- 
jects are given the S-B and the other half the 
WISC under carefully equated conditions, a t- 
test for the significance of the difference between 
the uncorrelated means provides an answer to 
the question posed. 

However, the use of independent groups may 
be quite inefficient, since the correlation between 
test scores for the same individuals will often 
be high. Testing only one group of children with 
both instruments and then using analysis of var- 
iance significance tests for correlated observa- 
tions should be more efficient. This method 
causes some difficulties in design and analysis 
not encountered with the independent-group tech- 
nique, but these are not serious enough to dis - 
qualify it. 

The simplest design of this sort occurs when 
only two tests are being compared. These two 
tests should be administered in counter-balanc- 
ed order to random halves of the total group. 
Half of the subjects should be tested first with, 
say, the S-B and second with the WISC; the other 
half should have the WISC followed by the S-B. 
In this way sequence effects will not be confound- 
ed with the difference in difficulty of the two tests. 
Taking a subject at random from the S-B, WISC 
group and one at random from the WISC, S-B 
group results in the simple latin square of Table 
I. There will be half as many such latin squares 
as there are subjects in the investigation, or as 
many as there are individuals in either group. 

It is more convenient to set up the scores for 
analysis in the form of Table II, where the two 
orders are kept separate. This is a consolida- 
tion of n latin squares like the one shown in 
Table I. 
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STATISTICAL ANALYSIS OF SCORES FROM 
COUNTERBALANCED TESTS 


JULIAN C. STANLEY! * 
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#*Footnotes will be found at the end of the article as well as the bibliography. 





Number 3 


The Table II scores can be analyzed for or- 
ders, individuals, tests, and sessions (first 
test versus second test) by a method set forth 
in considerable detail by Lindquist (17:273-81), 
Edwards (3:319-27; 4), Walker and Lev (24: 
373-81), and Grant (9:428-32;10). Apparently 
the design has been applied to this type of prob- 
lem only by Gellerman (7), who extended it by 
matching his two order groups individual » in- 
dividual (1 with n+ 1, 2 with n+ 2, : 
with 2n) on other variables but failed to ta k e 
this additional source of correlation into ac- 
count when analyzing the scores, 

There are the following degrees of freedom 
for sums of Squares from Table II: 1 each for 
tests, orders, and sessions; 2(n - 1) for differ- 
ences between individuals of the same order; 
and 2[(n - 1)(2 - 1)] for the ‘‘residual within 
individuals, ’’ which is the inclividuals of the 
same order by tests interaction pooled for the 
two orders. These d.f, add to 4n-1, or N - 
1, the usual sum. The proper ‘‘error’’ term 
for orders is the ‘“‘differences between individ- 
uals of the same order’’ mean square. Foréses- 
sions and tests the ‘‘residual within individuals’’ 
mean square is the appropriate divisor. The 
reader interested in the rationale of the analy- 
sis should consult Cochran (2), Edwards (3, 4), 
Grant (9,10), Kempthorne (13:Ch. 29), and 
Lindquist (17). 

Some examples of counterbalanced designs 
involving two tests are the studies of Frandsen 
and Higginson (6), Gerboth (9), Holland (12), 
Krugman and others (15), Kutash (16), Pastovic 
and Guthrie (19), Peel (20), and Vanderhost 
and others (23). Hays and Schneider (11) em- 
ployed a more complex counterbalancing pro- 
cedure with 16 independent groups. With the 
single exception of Gellerman, noted above, it 
appears that no test investigator has used the 
general type of design shown in Table II, though 
in at least eight different studies it would prob- 
ably have been more apprapriate than the sta- 
tistical analysis reported. Since no textbook 
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or article makes thoroughly explicit the proced- 
ure for this particular psychometric design, let 
us work through some data based upon the Fran- 


The Frandsen-Higginson Study 




















dsen-Higginson (6) report, showing both the Originally, Higginson tested 54 fourth-grade 
correct and incorrect way to analyze it. The children, 27 with the S-B first and the WISC 
writer is indebtedto ProfessorArdenN. Frand- second and the other 27 with the WISC firstand 





sen for these scores. the S-B second. Order and score information 





TABLE I 


THE ELEMENTAL LATIN SQUARE IN A COUNTERBALANCED TESTING STUDY 





Tests 















Orders Individuals S-B WISC 












S-B, WISC A S-B First WISC Second 



















WISC, S-B S-B Second WISC First 








TABLE I 


COUNTERBALANCED TESTING WITH THE S-B AND WISC 














Orders 1Q’s 
(Groups) Individuals S-B WISC 
1 First test Second test 
2 First test Second test 
S-B, WISC ° ° ° 
n First test Second test 
n+l Second test First test 
n+2 Second test First test 
WISC, S-B ° ° ° 


Second test First test 
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for 51 of the 54 testees was furnished the writ- 
er; for three persons the order of testing was 
lost at the time of inquiry. By discarding an- 
other testee randomly the writer had 50 individ- 
uals left, making a total of 100 scores, as out- 
lined in Table II. Table IV shows how analyz- 
ing the orders separately is related to the appro- 
priate overall analysis, which permits tests of 
both orders and sequences, as well as of indi- 
viduals (nearly always significant) and tests. 
The chief finding is an F of 6.77 for tests, 
which with 1 and 48 degrees of freedom is sig- 
nificant between the 5 and 1 percent levels, IQ’s 
being higher on the S-B than on the WISC. If 
the 25 scores for the S-B taken first are tested 
against the 25 for the WISC taken first (asimple 
randomized design), the Fi, 48 = 641/177 =3. 6, 


a considerable loss of precision. 3 

The mean square of 25.00 for sequences in 
Table IV is less than the residual error term, 
which is hardly surprising because the average 
practice effect inferred from Table II] means 
is -1.00 IQ points. Also, the two order (group) 
means, 106.38 and 102.74, do not differ signif- 
icantly when considered in terms of variation 
among individuals within orders. 

The six sums of squares are secured as fol- 
lows from the figures in Table II: 


S8.-5{ 2[(5319)*+(5137)"]-(10,456)"} =331.24. 


SSi= sy { 25(2, 212,516)-[(5319)?+(5137)*] } = 


12, 647.40. ° 





SStosts= 1, { 2[(6316)* +(5140)* ]-(10,456) , = ; 


S8,=1- { 2[(5253)*+(5203)"] -(10,456)? } =25.00, 


SS,=2{ 50(574,768+534,020)-25(2,212,516)- 


2[ (2716)? +(2603)? +(2600)* +(2537) ]+[(5319)?+ 
(5137)? | }2198. 24. 


SStotaj= z100(574,768+534,020)-(10,456)* ]= 
ae 15,508. 64. 


As Grant (9:430) has pointed out, when only 
two tests are being compared, each of the three 
factors (tests, orders, and sequences) is com- 
pletely confounded with the interaction of the 
other two. Therefore, the sum of squares for 
‘‘tests’’ is identical with that for the orders x 
sequences interaction: 
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i (2716+2600-2603-2537)* = 309. 76. 


Also, the sum of squares for ‘‘orders’’ is 
the same as the sum of squares for the tests x 
sequences interaction: 


<0 (2716+2603-2600-2537)* = 331. 24. 


Likewise, the ‘‘sequences’’ sum of squares 
may be secured via the orders x tests interac- 
tion: 


<i (2716+2537-2603-2600)* = 25. 00. 


A little reflection will convince the reader 
that these reiationships are quite reasonable. 
For example, the orders x sequences interac- 
tion will be significant only if the mean of the 
S-B taken first minus the mean of the WISC tak- 
en second differs significantly from the mean of 
the WISC taken first minus the mean of the S-B 
taken second: 


(Ms-p , ~Mwisc,) - Mwisc,~ 
may also be written 


(Ms-p, +Ms-p,) - Mwisc, + Mwisc,), Show- 
ing that the o x s interaction is equivalent to the 
‘‘tests’’ main effect (S-B versus WISC) when only 
two tests are compared. 

If just the ‘‘tests’’ effect is of interest, as in 
the Frandsen-Higginson study, the mean differ- 
ence between S-B and WISC scores of the 50 
testees may be tested for significant deviation 
from zero somewhat more easily than in Table 
IV. Obtain the 25 differences (say, S-B -WISC) 
for each order and the sum and sum of squares 
for each of the two sets of differences. Then 


Ms-p, b,.... ads 





— 50 
=1,48 50 25 \, /50 \, 
1 |25>pD? -[ sp}* -{ =p 
25 ” ‘= 26 
48 





619.52 _ ¢ 97. 





This is identical with the F for ‘‘tests’’ in 
Table IV, but note that both numerator and de- 
nominator of the F-ratio are twice as large as 
the analogous mean squares of Table IV. 

The statistical procedure used most often in 
studies employing two counterbalanced tests is 
a t-test for the significance of the difference be- 
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TABLE II* 


FRANDSEN-HIGGINSON (6) DATA 





WISC 
Orders Full-Scale Pupil 
(Groups) » IQ Sums 





120 252 
120 241 

S-B First, 
WISC Second 


82 166 





(Sum) 5,319 
(Squares) 1, 146, 615 
(Mean) 9 . 106. 
(S. D. ) 





233 
WISC First, 


S-B Second . . 
91 82 173 





(Sum) 2, 600 2,537 5, 137 
(Squares) 273, 736 259, 997 1,065, 901 
(Mean) 104, 101. 102. 
(S. D. ) 11. 10. 





(Overall Sum) 5, 316 5, 140 10, 456 
(Overall Squares) 574, 768 534, 020 2,212,516 


(Overall Mean) 106. 32 102. 80 104. 56 
(Overall S. D. ) 13. 84 10.61 





Test Taken First Test Taken Second 
(Sequence Sums) 5, 253 5, 203 








*This table has been abbreviated by omitting the scores of 44 subjects. Also, no scores are shown 


for Table V. These two tables and the material referred to in Footnote 5 are in the Appendix at the 
end of this article. 
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tween the correlated means, ignoring both order 
and sequence, This amounts to lumping the or- 
der sum of squares with the sum of squares for 
individuals and putting the sum of squares for 
sequences into the residual term. Inthe Frand- 
sen-Higginson investigation such consolidation 
makes little difference, since neither orders 
nor sequences contribute much variation. Where 
the two groups are randomly selected halves of 
the total sample utilized in the study, the order 
effect will probably be great only if experiment- 
al controls differ considerably for the twogroups 
or the two tests are quite different in contentor 
difficulty, so that taking a particular one first 
facilitates work on the second test much more 
than taking the other test first would do. When 
this latter condition holds, the analysis of var- 
iance is vitiated (see 13, Ch.29). As will be 
seen later in this paper (Table V), sequence 
(largely practice) effect may be a huge source 
of variation. 


Estimation of Test Means 





The counterbalanced design of the Frandsen- 
Higginson (6) study permits us to estimate that 
for this fourth-grade group the S-B IQ’s run 
106. 32 - 102. 80 = 4.52 points higher than WISC 
full-scale IQ’s. Since practice effect is negli- 
gible, these two means may be taken to be equiv- 
alent to the means that would have beensecured 
had all 50 subjects been tested with one or the 
other of the tests. When practice effect is sig- 
nificant, however, the test means should not be 
accepted as such estimates without a s uitable 
correction, although the difference between them 
may still be used directly. For instance, Krug- 
man, Justman, Wrightstone, and Krugman (5: 
477) in a counterbalanced study commented that 
‘‘At almost every age level, the obtained mean 
IQ’s are above 100, indicating that the sample 
chosen was slightly above average, although it 
was hoped that an average sample would be ob- 
tained by selecting representative schools, ’’ but 
they did not mention practice effect as a possible 
biasing factor. 


Estimation of r’s 





Coefficients of correlation between the two 
tests should be based upon within-order sums of 
squares and sums of products if either order dif- 
ferences or sequence effects are appreciable. 
Group differences tend to increase the r when 
computed from the overall scores, while prac- 
tice effect lowers it. In the Frandsen-Higgin- 
son (6) study the various r’s are: .7069 for the 
‘‘S-B first’’ group; .7678 for the ‘““WISC first’’ 
group; .7329 overall; and .7285 within groups.4 
Thus in this instance the two methods make no 
difference whatsoever in the second decimal 
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place, but there are probably many situations 
where the overall r will be considerably differ- 
ent from the within-groups r. Illustrations of 
this discrepancy will be given in Table VI. If 
X represents any score on one test and Y any 
score on the other, the Frandsen-Higginson 
within-groups r may be computed directly as 
shown at the top of page 9. 


Estimation of Standard Deviations 





Standard deviations will be affected by dif- 
ferences between groups and by practice effect, 
being increased by both. An unbiased estimate 
of the population variance of the S-B IQ’s may 
be secured by taking the 232,544 figure in the 
denominator of the computation at the top ofthe 
next page and dividing it by 2(24)(25), or 1200. 
The divisor in general is 2(n - 1)(n), where n 
represents the number of individuals in either 
group. To obtain an approximately unbiased 
estimate of the population standard deviation, 
extract the square root of the variance. 

To recapitulate: Counterbalanced testing 
yields a direct estimate of differences between 
forms or tests, since each is affected about 
equally by practice effect, but the counterbal- 
anced design calls for revised methods for com- 
puting coefficients of correlation and standard 
deviations if practice effect is appreciable and 
for a more elaborate significance test. 


More Than Two Tests or Forms 


Three Forms of the Same Test and Three 
Orders 





During the process of equating an old form 
(A) of a certain aptitude test with twonew forms 
(B and C), the writer and several assistants 
tested 30 eleventh and twelfth grade boys ina 
State orphans’ home in three counterba la nced 
orders: ABC, CAB, and BCA. Orders CBA, 
BAC, and ACB were not used. Ten boys, six 
from the eleventh grade and four from the 
twelfth, were assigned at random to each order. 
One test was given each day for three consecu- 
tive days under carefully standardized conditions. 
Table V shows the analysis of variance of the 
90 scores. Individual, form, and day differ- 
ences are all significant well beyond the .001 
level, while the F ratio of 199.54/345.75 for 
order means is not significant. 

Some investigators would analyze this data 
by ignoring the ‘‘order’’ classification entirely 
and performing a two-way analysis of variance 
(individuals and forms). Order itself is prob- 
ably of little importance in most well controlled 
studies. However, when practice effect (days) 
is a large source of variation, as it is here, the 
significance test for the differences among form 
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TABLE V 


ANALYSIS OF VARIANCE OF THREE FORMS OF A TEST ADMINISTERED IN THREE 
COUNTERBALANCED ORDERS TO 30 DIFFERENT INDIVIDUALS, ONE FORM 
EACH DAY FOR THREE DAYS 














Source of Variation Sum of Squares d, f, Mean Square 
Orders 399. 0889 2 199. 54 
Individuals within orders 9335. 2333 27 345. 75* 
Forms 4361. 6889 2 2180. 84* 
Days (Practice effect) 3509. 3556 2 1754. 68* 
Forms xX days interaction within 

orders** 6. 1556 2 3. 08 
Residual (individuals within 

orders x forms interaction) 2373, 4667 54 43, 95 

Total 19984. 9889 89 

*P < .001 


**For discussions of this term see Lindqvist (17:278-281) and Edwards (3:324-326; 4). 
Lindquist analyzes the forms xX days interaction sum of squares into two independent 
components, forms x days interaction between orders, which is equivalent to the 
‘‘orders’’ effect above, and forms x days interaction within orders, Edwards di- 
vides the orders x forms interaction into days and a ‘‘Latin square error’’ remain- 
der (4:126); he concludes that latin square error is not likely to be a significant 
source of variation (3:326; fn. 10). Grant (9:435-437) calls it ‘‘square uniqueness,’’ 
saying that it may be tested to determine whether or not ‘‘it has been inflated or de- 
flated by the unique pattern of confounding which occurred in the interactions of this 
particular square,’’ (P. 435) 


# The residual sums of squares for the three orders, each with 18 d,f,, are: ABC, 
618.53; CAB, 507.53; and BCA, 1247.40, Bartlett’s test of homogeneity (21: 250) 


yields a chi-square value of 4.22 with 2 d.f,, to which corresponds a P of approxi- 
mately .15. 
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TABLE VI 


DISCREPANCIES BETWEEN r’s AND BETWEEN VARIANCES COMPUTED FROM 
OVERALL DATA (ORDERS IGNORED) AND FROM POOLED WITHIN-ORDERS 
FIGURES FOR 30 TESTEES TO WHOM WERE ADMINISTERED TEST 
FORMS A, B, AND C 














Statistics Overall Within Orders 
r ‘AB . 42 .70 
Fac 49 .81 
Tac . 40 - 62 
Variance of Form: A 186 179 
B 200 154 
Cc 153 101 

TABLE VII 


A GRECO-LATIN SQUARE DESIGN, WITH n INDIVIDUALS PER GROUP 
(TEST FORMS A, B, AND C PRINTED ON GREEN, WHITE OR 
YELLOW PAPER) 








First Second Third 
Group Session Session Session 
1 AW CG BY 
2 BG AY cw 
3 CY BW AG 








(Vol. 23 
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50 25 25 
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50 50 


agaL - GATE + BARD 





Twithin = 





50 . 25 R 50 
aEx' -|Ex* + EX) 


50 25 50 
1 1 26 


25(551, 864) - [(2716)(2603) + (2600)(2537) | 








¥ { 25574, 768) - [(2716)* + (2600)? ]} { 25(534, 020) - [(2603)? + (2537)*] } 


130, 752 








¥ (232, 544)(138, 522) 


means would lose much power, since the sums 
of squares for practice effect and ‘‘forms x days 
interaction within orders’’ would be pooled with 
the residual sum of squares to form a new error 
term which is too large. In Table V the F for 
forms is 2180. 84/43. 95 = 49.6, with 2and 54 
d.f., while the inappropriate ratio is 





2180. 84 a 
(3509. 3556 + 6. 1556 +2373. 4667)/(2 + 2 + 54) 
2180.84 _ 
101.53 as,5, 


with 2 and 58 d.f. Of course, in this particular 
Study either F (49.6 or 21.5) is significant far 
beyond the .001 level, but in some situations the 
difference would be important. 

The reader will recall that in the Frandsen- 
Higginson study coefficients of correlation and 
variances differed little, whether computed from 
within orders or overall, since the order means 
were not far apart and practice effect was negli- 
gible. A glance at Table VI will show that this 
is not always true, however, since the r’s and 
variances for the 30 testees show considerable 
systematic discrepancies, the within-orders r’s 
being uniformly higher than the overall figures 
and the within-orders variances lower. 


All Possible Orders 





Grant (9:432-434) discusses an experiment in 
which all six permutations (orders) of three things 
were employed, stating that ‘‘when complete sets 
of permutations are used, all interactions within 
such sets of squares cancel out so that they are 
not confounded with the main effects’’ (p. 434). If 





Form A in the study summarized in Table V had 
not had a shorter time limit than Forms B and 
C, it would have been feasible to use five sub- 
jects for each of the six possible orders and to 
test all 30 individuals in the same room 
with the three forms during a single day. Each 
person could be assigned randomly to a seat in 
order to prevent inter-pair correlation caused 
by friendship pairing. Cheating would be min- 
imized so much that subjects could sit in adja- 
cent seats, rather than only in alternate ones. 
Experimental conditions would be controlled 
betcer than when each order group was testedin 
a separate room. It would even be feasible to 
use the 24 permutations of four forms or tests 
if enough subjects (at least 48) anda large 
enough room were available; two subjects per 
order yield 1x 3x 24=72d.f. for testing the 
‘*forms’’ and ‘‘sequences”’ effects. 


Greco-Latin Crossover Designs 





Walker and Lev (24:377-381) illustrate the 
addition of another superimposed variable to 
form a greco-latin square. This is not possi- 
ble with a 2 x 2 design; at least a 3 x 3 setup is 
required. Suppose that in the three~form equat- 
ing study three different hues of paper—-green 
(G), white (W), and yellow (Y)—had been used 
to make the design of Table VII, where A, B, 
and C represent the three different forms of 
the test. Each letter occurs once in each row, 
once in each column, and once with each other 
letter, assignment being random within the lim- 
its of these restrictions. Preferably, assign- 
ment of individuals to groups would also be ran- 
dom in the types of studies discussed here. The 
degrees of freedom are: sessions, groups, 
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TABLE VI 


A FACTORIAL CROSSOVER GRECO-LATIN DESIGN 





School Grade 
or Class Orders Individual 





Al B2 C3 








B1 A3 C2 





Al B2 C3 








Bl A3 C2 





Al B2 C3 








Bl A3 C2 





Al B2 C3 








Bl A3 C2 





Al B2 C3 








Bl A3 C2 





Al B2 C3 








Bl A3 C2 
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TABLE IX 


DEGREES OF FREEDOM FOR THE VARIOUS SUMS OF SQUARES FROM TABLE VII 





Total between individuals: 432 - 1 = 431 


w 
wna 
' 
eee 


w 
Ne ol 


Orders: 
Sexes: 
Grades: 
Grade xX sex X order: (3 - 1)(2 - 1)(36 - 1) = 70 
Grade X sex: (3 - 12 - 1) =2 
Grade x order: (3 - 1)(36 - 1) = 70 
Sex x order: (2 - 1)(36 - 1) = 35 
Individuals within 

orders: 3(2)(36) [(2 - 1)] = 216 


uonuwiu 





Total within individuals; (432 - 1)(3 - 1) + (3 - 1) = 2(432) = 864 


ee 


Roman numerals: 
Latin letters: 
Arabic numerals: 
Grade < sex X roman: 
Grade x sex x latin: 
Grade X sex X arabic: 
Grade X roman: 
Sex X roman: - 
Grade x latin: 
Sex x latin: - 1)(3 - 
Grade x arabic: - 1)(3 - 
Sex X arabic: (2 - 1)(3 - 1) =2 
Individuals x roman numerals within orders in grade-sex groups: 
3(2)(36)[(2 - 1)(3 - 1)] = 432 
Residual order x roman interaction within grade-sex groups: 
3(2)[(36 - 1)(3 - 1) - (3 - 1) - (3 - 1)) = 396 


ww ww 
nou wu 
mw HY 


— ee et 
—ee— eS 
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! 
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3(2)(36)(2)(3) - 1 = 1295 
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forms, and hues, 2 each; individuals within 
groups, 3(n - 1); residual (individuals x ses- 
sions interaction within groups), 3 {(n - 1)(3 - 
1)]; and total, 9n- 1. The interested reader 
is referred to Walker and Lev (24:378-381) for 
computational details, which differ from the lat- 
in-square design only in that the extra effect, 
hue, uses up the two d.f, which in a replicated 
3 x 3 latin square would be assigned to ‘‘latin 
square error.’’ Before anyone attempts to de- 
sign studies of this type he should be familiar 
with latin and greco-latin square designs as ex- 
plained by Grant (9), Edwards (3:303-332; 4), 
Lindquist (17: 258-265), Walker and Lev (24: 
373-381), and Archer (1). 

There are 36 different rows possible for the 
3 x 3 greco-latin square design. (In the types 
of studies discussed here these yield only 12 dif- 
ferent 3 x 3 greco-latin squares, since the or- 
der of the rows within a particular square is 
immaterial.) A completely counterbalanced de- 
sign can be obtained with just 36 subjects as - 
signed randomly to the 36 rows. This gives the 
following d.f., using the terminology of Table 
Vil: days, forms, and hues, 2 each; between 
individuals, 35; residual, (35)(2) - 2 - 2 = 66; 
total, 107. The basic 36-person design may 
be replicated for the two sexes and in, say, 
two different schoo! grades, yielding a total of 
(36)(3)(2)(2) - 1 = 431 d.f.: 2 each for days, 
forms, and hues; 140 for ‘‘between individuals 
within groups’’; 1 each for sex and grade; 1 for 
the sexes x grades interaction; 2 each for sexes 
xX days, sexes x forms, sexes x hues, grades x 
days, grades x forms, grades x hues, sex x 
grades x days, sex x grades x forms, and sex 
x grades x hues; and 264 for the residual. 5 The 
confounding influence of ten other interactions 
has been counterbalanced out by using the 36 - 
orders design. 6 

A test publisher might employ this kind of 
design with three different types of answer 
sheets for three different test forms at several 
different age levels, in each of which the testees 
are subdivided according to socio-economic 
class. 

Probably in most comparisons of two individ- 
ual tests, ‘‘tester’’ should be introduced as an 
explicit variable, analogous to ‘‘sex’’ or ‘‘schools’’ 
If three tests are utilized, however, three test- 
ers may be made a ‘‘greek’’ variable, like hue 
in Table VII. Since tester and test are likely 
to interact considerably, it seems desirable that 
all 36 possible greco-latin orders be used, 


Replication Within Orders 





Obviously, it is possible to use more than 
one individual for each order within each grade- 
sex group and thereby get within-order varia- 
tion among individuals, as outlined in Tables 
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VIII and IX. If just two individuals are tested 
with each of the 36 possible greco-latin orders 
for each sex and within each of three grades, 
there will be 2 x 36 x 2 x 3 = 432 subjects need- 
ed.7 The between-individuals-within-orders 
mean square has 216 d.f. The individuals x ro- 
man numerals interaction within orders has 3 
(2)(36) [(2 - 1)(3 - 1)] = 432 d.f.; the order by 
column residual within each grade-sex group has 
(36 - 1)(3 - 1) - 2 - 2 = 66, which when summed 
over both sexes and all three grades yields 396. 
Presumably this ‘‘greco-latin square error’’ or 
‘‘square uniqueness’’ will not be significant when 
tested against the individuals x roman numerals 
interaction, since all 12 possible greco-latin 
squares are involved in each of the six grade- 
sex groups, so its sum of squares can be com- 
bined with those for individuals x roman numer- 
als interaction to yield a ‘‘within’’ error esti- 
mate with 396 + 432 = 828 d.f. 

This design illustrates replication both by 
using the same square and by employing differ- 
ent ones, thus combining the two types of de- 
signs set forth by Edwards (3, 4) and going be- 
yond the ones that Archer (1) proposes. 


The Krugman Study 





In a counterbalanced comparison of the Re- 
vised Stanford-Binet with the WISC, Krugman, 
Justman, Wrightstone, and Krugman (15) used 
various examiners to test both boys and girlsat 
ten different age levels, but their analysis took 
into account only age (and not in the manner dis- 
cussed here). A rigorous overall design for 
this type of study seems feasible. It would be 
complex, however, for there are 24 testable 
main effects and interactions in addition to those 
involving the testing order (S-B first vs. WISC 
first). Furthermore, if testers were selected 
randomly from a defined population rather than 
only on the basis of availability, serious mixed- 
model problems might occur. 

There is a limit to the extent that crossover 
designs may be elaborated without causing 
grave difficulties in manipulation and interpre- 
tation, but the writer’s survey of the relevant 
journal literature convinces him that this limit 
has not been approached closely yet by most psy- 
chometricians. In particular, the 2 x2 latin 
and the 3 x 3 completely permuted greek cross-~ 
over designs seem to merit considerable use. 8 


Summary 


Psychometricians seldom use crossover 
(switchover) analysis of variance procedures, 
even when they counterbalance tests. The sim- 
plest of these designs, a 2 x 2 comparison of 
the Stanford-Binet and the WISC, is illustrated 
in detail with scores from the Frandsen-Higgin- 
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son (6) study. Thena 3 x 3 design, involving 
hitherto unpublished scores from three forms 
of an aptitude test, is analyzed. Completely 
permuted latin and greco-latin designs, with 
factorial extensions, are discussed, The writ- 
er concludes that crossover designs should be 
quite useful for comparing different tests or 
several forms of a test. He particularly em- 
phasizes the 3 x 36 greco-latin arrangement. 
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FOOTNOTES 


The writer is indebted to Mr. V. N. Amble 
and Professor Oscar Kempthorne for making 
many helpful suggestions concerning an earl- 
ier draft of this paper. Theyare not, of course, 


responsible for any errors or ambiguities that 
remain. 


2 Edwards (4) points out that the two residual 
terms should be tested for homogeneity be- 
fore their sums of squares are combined. 





200 


62.17/29. 30 = 2.12, to which corresponds a 
P of approximately 2 x .04=.08. Similarly, 
the two sums of squares for individuals may 
be pooled, since F24 24 = 7, 472. 28/5, 175. 12 


= 1.44, while the F needed for significance at 
the 10 percent level is 1.98. The cautious in- 
vestigator may want to use Tukey’s (22) test 
for non-additivity of row and column effects, 
also. 


3The standard deviation of the “‘S-B first’’ scores 
(15.45, shown in Table III) is nearly signifi- 
cantly larger than the ‘‘WISC first’’ standard 
deviation (10.08), P being . 06. 


40% course, before a pooled within-groups r is 
computed, the individual within-group r’s 
should be tested for homogeneity. Here the 
difference between . 7069 and . 7678 yields a 
CR of only 0.5 via Fisher’s z-transformation 
(5: 203-204). If individuals are assigned to the 
various order-groups randonly, significant dif- 
ferences among comparable within-group r’s 
seem unlikely, though their possibility should 
be recognized. 


5This complete design is presented and discus- 
sed in the Appendix at the end of this article. 
Also see Archer (1:532~-536). 


6wWhen all possible permutations are built into 
latinized, greco-latinized, or hyper-greco- 
latinized designs, the objection to the use o° 
latin squares in psychology that McNemar (18) 
raised is obviated. (For a comment on his 
article see Kogan [14:26-27].) In testing 
studies, completely counterbalanced designs 
are often fairly easy to manage and more econ- 
omical than other procedures. However, Mr. 
V. N. Amble has pointed out to the writer that 
certain procedures other than full permutation 
may be at least as effective. 


7Two school grades or classes require 288 in- 
dividuals, each of whom takes three different 
tests or forms. These may seem to be large 
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figures to the independent investigator, but 
they are small compared with the numbers in- 
volved in standardization studies by major 
test companies. 


8in a personal communication to the writer, 


Professor Edward E, Cureton points out the 
need for a clear discussion concerning the ap- 
plicability of analysis of variance methods to 
those cases where one criterion of classifica- 
tion involves classes of the variable rather 
than classes of the examinees, especially when 
these classes are tests such as the S-Band 
the WISC or ‘‘comparable’’ forms of atest. 
In part, this is the repeated-m eas urements 
problem dealt with by Leonard S. Kogan 
(‘‘Analysis of Variance—Repeated Measure- 
ments, ’’ Psychological Bulletin, XLV (1948), 
pp. 131-143) and Jack Block and others (‘‘Test- 
ing for the Existence of Psychometric Pat- 
terns, ’’’ Journal of Abnormal and Social Psy- 
chology, XLVI, 1951, pp. 356-359). Frederic 
M. Lord (The Standard Errors of Various 
Test Statistics When the Test Itemsare Sam- 
pled, Educational Testing Service, Princeton, 
New Jersey, December 1953) has examined in 
detail the case where a large number of forms 
of the same test are administered to the same 
group of examinees, each form consisting of 

a sample of items drawn randomly from a 
common pool of items, and has derived six 
large-sample standard-error formulas for this 
situation. However, in the illustrations used 
in the present study the different tests are not 
randomly parallel. Furthermore, random 
sampling of items might yield forms that when 
used with various groups of examinees consist- 
ently give rather different scores, especially 
if the number of items in each form is small. 
These forms could not be used interchange- 
ably without a transformation, even though 
they might be ‘‘comparable’’ in the randomly 
parallel sense. 

‘‘The ‘Equating’ of Non-Parallel Tests’’ is 
discussed by William H. Angoff (SR-54-13, Ed- 
ucational Testing Service, Princeton, New Jer- 
sey, May 1954). 
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APPENDIX 


(To Supplement Table III of the Article) 
100 IQ SCORES FOR THE 50 FRANDSEN-HIGGINSON SUBJECTS 





S-B First, WISC Second WISC First, S-B Second 





WISC WISC 
Pupil S-B FS Pupil S-B FS 
No. IQ IQ No. IQ IQ 





1 132 120 26 118 125 

2 121 120 27 118 115 
112 28 117 
138 29 113 
125 30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 
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(To Supplement Table V of the Article) 
90 SCORES FOR THE 30 INDIVIDUALS TESTED WITH THREE FORMS 
OF AN APTITUDE TEST 








Testing Individual Sums 
Order Number A (A+B+C) 


182 
201 
173 
192 
165 
167 
147 
134 
140 
112 





cowmnourwnre 


~_ 





1,613 


267,141 





202 
193 
177 
154 
135 
157 
139 
120 
116 

87 





Sums 1, 480 


t 
Squares 39, 671 . 230, 718 





91 50 208 
80 31 166 
73 57 182 
70 35 158 
66 53 163 
62 37 127 
58 22 115 
56 21 121 
50 26 122 
47 23 116 





Sums 653 355 1,478 


Squares 44,319 14, 243 23, 140 227,812 





Overall Sums 1,819 1,379 1,373 4,571 


Overall Squares 115, 679 69, 197 67, 265 725,671 





Day 1 Day 2 Day 3 
Sums for Days 1, 264 





Total Sum 
of Squares 115,679 252,141 
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Procedure for obtaining the sums of squares shown in Table V is: 
Total = $5 [90(252, 141) - (4571)? ] = 19, 984. 9889 
Orders 4, {3[a613) + (1480)? + (1478)?] - (4571)?} = 399. 0889 


Forms 4, {3{(1819)? + (1379)? + (1373)?] - (4571)*} = 4361. 6889 


Days as ((1264)? + (1608)? + (1699)*] - (4571)?} = 3509. 3556 


Individuals 
withinorders = 


Jp {10(725, 671) - [(1613)* + (1480)? + (1478)*] } = 9335. 2333 


Forms xX days interaction within orders (latin square error) = 


4, {9 [(649)" + (521)? + (543)? + (617)? + (503)? + (360)? + (653)? + (355)? + (470)? ] 
-3 [(1613)? + (1480)? + (1478)? ] - 3[(1819)* + (1379)? + (1373)*] - 3[(1264)? + (1608)? 
+ (1699)*] + 2(4571)*} = 6. 1556 


Residual (incividuals x forms interaction within orders) = 


Jy (30(252, 141) ~ 10(725, 671) - 3[(549)? + (521)* + (543)? + (617)? + (503)? + (360)? 
+ 653)? + (355)? + (470)? ] + [(1613)? + (1480)? + (1478)? ] } = 2373. 4667 


Of course, either one of the last two terms could be secured more simply by subtraction, but the 
longer method provides a check upon the accuracy of the computations. Note also that the ‘‘latin square 
error’’ is the difference between the sum of squares for orders x forms interaction and the ‘‘days’’ ef- 
fect: 


3p (91549)? + (521)? + (543)? + (617)? + (503)? + (360)? + (653)? + (355)? 
+ (470)? ] - 3[(1613)? + (1480)? + (1478)?] - 3[(1819)? + (1379)? + (1373)?] 
+ (4571)? } - 45 {3 [t1264)" + (1608)? + (1699)?] - (4571)? } = 3515.5111 


~ 3509. 3556 = 6. 1556 


Thus this ‘‘latin square error’’ sum of squares is simply the summation over individuals, orders, 
and forms of the squared residuals of the various order-form means corrected for orders, forms, and 
days: 


EE PF (Kop ~ Xo ~ Xp - Xq + 2X)’. 





JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 23 


MEANS OF THE THREE FORMS, THREE ORDERS, AND THREE DAYS 





Form 





A B c 





54. 90 52. 10 54. 30 53.77 ABC Mean 
61.70 50. 30 36. 00 49. 33 CAB Mean 


65.30 35. 50 47. 00 49,27 BCA Mean 








60. 63 45. 97 45.77 50. 79 Grand Mean 





First Day = 42.13 Second Day = 53. 60 Third Day = 56.63 





It is obvious from the above table that the means for Forms B andC differ extremely little (0.20 points), 


while Form A is much easier. The appropriate F-test for the small difference between 45.97 and 45. 77 
is: 


1 
TRE... {2 (1379)? + (1373)?] - (1379 + 1373)7} _ 9, g000 
1,54 43. 95 43, 95 





The means for the first and second days differ much more than those for the second and third days: 


11.47 versus 3.03 points. To test the significance of the difference between 53.60 and 56.63, use the 
following ratio: 





1 
80(2 [(1608)? + (1699)"] - (1608 + 1699)7}_ 138.02 _ 3.14 P> .05 
a5 3 ce < _iicigat 


If the overall F-test for differences among the three order means had been significant, we could have 
tested the discrepancy between CAB and BCA order means by 


. 
Fj 27 = 60f 2[(1480)* + (1478)? ] - (1480+ 1478)?} This presupposes that the differ- 





345. 75 


ences between tests in their residual effects are negligible. If not, the differences between the order 
groups will be exaggerated and the significance test vitiated. See Kempthorne (13:Ch. 29). 


Completely Permuted 3 x 3 Greco-Latin Square Designs 





As noted in the article, there are 36 different rows possible for the greco-latin design with three 
levels of each of the variables. Using the capital latin letters A, B, and C to represent the latin vari- 
able and the arabic numerals 1, 2, and 3 to represent the greek variable, we list the 12 greco-latin 
squares that these form (the order of rows within each square is immaterial for the type of problem dealt 
with in this paper): 
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Square 
Number 























By assigning 36 individuals randomly to the 36 rows and analyzing the results without regard tosquares, 
one obtains tests of the three main effects free from confounding with the various untested interactions. 
Rows and individuals are completely confounded, but this is probably unimportant because the row factor 
is of no direct interest in test-comparison studies, though it can serve as a check upon the differential 
residual effects of the treatments (see 13:Ch. 29) if we assign at least two subjects randomly to each of 
the 36 rows. This uwsign yields the following d.f.: ‘2 each for roman numerals, latin letters, and ara- 
bic numerals; 35 for rows; 36 for individuals within rows; and 72 for individuals x roman numerals in- 
teraction within rows, to which would be added the (36-1)(3-1)-2-2=66 that in an incompletely permuted 
design would be called ‘“‘square uniqueness”’ (Grant) or ‘“‘ eco- ] latin square error’’ (Edwards) or ‘‘ro- 
man numerals x latin letters x arabic numerals interaction within rows” (Lindquist), giving 138 d.f. for 
the ‘‘within’’ error term used in testing all of the effects except rows, for which the ‘‘individuals within 
rows’’ mean square is appropriate. 
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When working with the 3 x 3 greco-latin square design or modifications thereof, it is helpful to keep 
in mind that for a single such square there are no d.f. for the residual, since [(3 x 3) - 1]- 2-2-2-2 
= 0. Thus, as for a 2 x 2 latin square, each effect is completely confounded with the interaction of the 
other effects. When only 36 individuals are tested, we are in effect using as the ‘‘within’’ error term 
three pooled interactions, each with 11 x 2 d.f.: 12 squares x 3 roman numerals; 12 squares x 3 latin 
letters; and 12 squares x 3 arabic numerals. There seems to be no reason for thinking that these inter- 
actions are not unbiased estimates of experimental error if differences among the treatments in their 
residual effects are negligible. 

Testing 36 boys randomly assigned to each of the 36 possible rows and 36 girls assigned likewise 
yields a design somewhat similar to the one set forth by Archer (1:524) inhis Tablel. His scheme 
calls for three different 4 x 4 squares for the boys and three other 4x 4 ones for the girls, but the 
method of analysis is identical; see his pp. 523-529, especially p. 527. In the present discussion, an 
added refinement would be to divide the total sum of squares between the 72 individuals into three 
components instead of two: between rows, with 35 d.f.; between sexes, with 1; and the residual (inter- 
action of rows with sex), with 35. This is possible because each boy has a ‘‘mate’’ among the girls who 
follows exactly the same order as he; in this sense there are 36 row means for the 72 testees. ‘he res- 
idual mean square would be the error term for both rows and sex, but it seems unlikely to the author 
that this triple partitioning will in actual practice be worth the extra effort. 

If 36 boys and 36 girls are tested in each of several school grades, say three different ones, the de- 
sign will be as shown on the following page. 


There are 3(216) - 1 = 647 d.f. Of these, 216 - 1 = 215 are for differences among the over-all means 
of the 216 individuals (differences between individuals), and (216 - 1)(3 - 1) + (3 - 1) = 2(216) = 432 are 
for differences ‘‘within individuals. ’’ 

The between-individuals sum of squares may be partitioned into four independent components, with 
the following d.f., if ‘‘rows’’ are disregarded: grades, 2; sex, 1; grades x sexes interaction, 2; and dif- 
ferences among individuals within the six grade-sex groups, 6(36 - 1) = 210. If the 36 different greco- 
latin rows are taken into account, the 210 d.f., would be further subdivided into 35 for rows, 35 for rows 
x sex interaction, 70 for rows x grades interaction, and 70 for the residual (rows x sexes x grades inter- 
action). 

The 13 components of the ‘‘within’’ variance are secured from roman numerals, latin letters, and ar- 
abic numerals, their interactions (6 simple and 3 second-order) with grades and/or sex, and the residual 
within grade-sex groups, which has 6[(36 - 1)(3 - 1) - 2 - 2] = 396 af. 

The analysis of these designs is jeopardized when significant greco-latin order differences or inter- 
actions of orders with other effects occur because of differences among the treatments with respect to 
their residual effects (not just because the individuals differ from row to row). Kempthorne (13) explains 
why this is so in his Chapter 29, especially on pp. 596-597 and 606-607. The design in Tables VIII and 
IX of the present writer’s article includes two persons for each of the 36 greco-latin orders, thus permit- 


ting significance tests of the ‘‘order’’ effect (differential residual effects) and of the three interactions of 
order with grade and/or sex. 
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THE INFLUENCE OF SOCIO-CULTURAL CHAR- 
ACTERISTICS ON EDUCATIONAL OPPOR- 
TUNITIES IN PUBLIC SCHOOL 
INSTRUMENTAL MUSIC’ 


MELVIN L. ZACK 
Los Angeles State College 


Introduction 


THROUGH THE education of youth, anat- 
tempt is made, inmostcultures, to maintainstand- 
ards and foster ideals considered to be of primary 
importance. Inasmuch as equality of opportunity 
has been a cultural ideal of the United States, 
public education has been expected not only to 
expound this ideal, but to become the agency 
chrough which this ideal could be translated into 
reality. For the most part, both lay and profes- 
sional opinion are in-agreement with the concept 
that the abilities and interests of the childshould 
be the only limitations placed upon his participa- 
tion in any school-sponsored activity. 

Many studies, however, have indicated that 
differences in children’s backgrounds due to socio- 
economic status, color, religion, nationality 
background, or family size interfere with the at- 
tainment of equitable educational opportunities in 
public schools, 1,2** Most of these studies, how- 
ever, have failed to take into account individual 
differences in the abilities and interests of chil- 
dren. The present study was designed with the 
specific purpose of considering two sets of vari- 
ables in their influence on participation. Not only 
was the influence of socio-cultural variables, 
such as color, to be determined, but also the in- 
fluence of abilities and interests, as well as the 
interaction between these two sets of variables. 

In an investigation concerned with many vari- 
ables, the validity of the results is determined 
largely by the adequacy of the design. It is for 
this reason that the method of procedure is given 
in considerable detail. 


Statement of the Problem 





The problem was first to determine the abil- 
ity-interest differences between participants and 
non-participants in instrumental music; second, 
to determine the socio-cultural differences be- 
tween participants and non-participants; and, 
most important of all, to take both sets of differ- 
ences into account in determining whether or not 








the public schools of Kansas are providing to all 
children equal opportunities for participation in 
instrumental music. 

The need for this investigation arose from a 
consideration of the following conditions and as- 
Sumptions: It has been assumed that instrumen- 
tal music courses are both elective and selective. 
They are elective in the sense that no child is re- 
quired to participate. It has been assumed that 
any child who wishes to participate may do so. 
With respect to the selective character of instru- 
mental music courses, it has been assumed that 
children with appropriate abilities and interests 
tend to enter and remain in such courses, where- 
as children who lack these abilities and interests 
either fail to enter,or enter only to drop out after 
a period of participation. 

Two conditions in Kansas cast suspicion onthe 
validity of these assumptions. One condition is 
that a majority of schools furnish at public ex- 
pense only a fraction of the equipment which a 
student must own in order to participate in band 
or orchestra. This condition suggests that indi- 
vidual differences in socio-economic status may 
override individual differences in ability and in- 
terest in influencing participation. The second 
condition is that high prestige is generally as- 
sociated with membership ina school instrumental 
group. This condition suggests that individual 
differences in status due to color, nationality 
background, religious preference, or size of fam- 
ily may override individual differences inability 
and interest in their influence upon participation. 

The possible contradictions between these as- 
sumptions and conditions, as discussed above, 
made it necessary to take into account both socio- 
cultural characteristics as well as appropriate 
abilities and interests in studying participation 
in instrumental music classes. 

What are the abilities and interests which could 
be regarded as legitimate requirements for par - 
ticipation in the instrumental music offerings of 
the public schools? A review of the literature on 


* A summary of A ey of the Influence of Selected Socio-cultural Characteristics on Participation 
in Instrument lc, unp octor sertation, versity of Kansas, e 


%% Footnotes will be found at the end of this article. 
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music education pointed to four factors: music- 
ality, intelligence, the youngster’s interest for 
music, and e pypport for music in the young- 
ster’s home, 9» 4,9, 6 

With the selection of these four variables as 
the abilities and interests which could be regard- 
ed as legitimate limitations upon participation, 
and with the selection of socio-economic status, 
color, religion, nationality background, andfam- 
ily size as the socio-cultural characteristics 
which could not be regarded as legitimate limi- 
tations; the problem of the present study became 
a search for answers to two specific questions: 


1. Are participants in instrumental music 
significantly superior with respect to 
abilities and interests as compared to 
non-participants and as compared to 
drop-outs ? 

2. When ability and interest differences are 
taken into account, do individual differ- 
ences in status due to socio-cultural 
characteristics have a significant asso- 
ciation with participation or with drop- 
out ? 


Selection of the Sample 





The sample had to include youngsters who had 
reached that point in school where, if they were 
not already participating or had not already par- 
ticipated in instrumental organizations, it was 
very unlikely that any future participation. under 
high schoolauspices would take place. This 
criterion simply meant that a part of the sample 
should consist of seniors in their final semester 
of high school study. 

Because it also was desired to include poten- 
tial high school drop-outs in the sample, apart 
of the sample had to include children who were 
at that level in school immediately prior to a 
year of high incidence of general drop-out. Since 
the ninth grade is a year of high general drop-out, 
it was decided the sample should include eighth 
grade students. 

There still remained the problem of selecting 
these samples from the total population of eighth 
grade students and seniors in Kansas schools. 
Inasmuch as color, religion, nationality back - 
ground, and socio-economic status were four of 
the characteristics whose influence upon instru- 
mental participation was being measured, it was 
desired to select schools from communities where 
these characteristics apparently were most varied. 
The rationale for this procedure was obvious. For 
example, in an all-white community it would be 
impossible to ascertain the influence of color upon 
participation. For this reason the most recent 
census data available on racial structure, nation- 
ality structure, educational structure, and occu- 
pational structure were utilized’, Because census 
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data did not include relisious preferences of the 
population, it was impossible to derive ratios for 
this characteristic. 

Each city in Kansas was ranked in comparison 
with all other cities in its population bracket with 
respect to the size of the following_ratios: 


1. Negro population to total population. 

2. Foreign-born white population to native- 
born white population. 

3. Adult college graduates to all adults. 

4. Adults with no years of formal schooling 
to all adults. 

5. Employed males in the professional level 
occupations to all employed males. 

6. Employed male laborers to all employed 
males. 


The larger the ratio, the higher was thatcity’s 
rank. These six ranks were averaged and the 
cities then were ranked on the basis of this single 
average. Eight school systems in cities with the 
highest average ranks were invited to participate 
in this study. Two of the eight declined. Another 
school system in a high ranking community, in 
the same population bracket with the two which 
had declined, was selected, and its school offi- 
cials agreed to participate. 

This sampling procedure resulted ina total 
sample of 960 students, of whom 538 were eighth 
graders and 422 wer? seniors. 


Devices for Securing Data 





Intelligence Test Information 


In order to help secure the cooperation of 
schools selected for the study it was decided to 
administer the Otis Quick-Scoring Tests of Men- 
tal Ability, Gamma and Beta, to seniors and 
eighth graders, respectively, in those school sys- 
tems which desired this service. In other schools 
the record of any standard group intelligence test 
results was utilized. Within each school and at 
each grade-level, IQ’s wereconverted to Z- 
scores with a mean of 50 and a standard deviation 
of 10. 

This procedure was followed in order that chil- 
dren measured by different tests of intelligence 
might be compared with each other on the basis 
of their deviation from the mean of their own 
group. Z-scores were also needed in order tode- 
rive a composite score based on the means and 
standard deviations of the following four measures; 
intelligence, musicality, interest for music, and 
home support for music. 





Musicality 


Measurement of musical ability was obtained 
by administering A Test of Musicality toallsub- 
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jects. The items on pages two and three of this 
test are designed to measure ‘‘the excellence of 
the ability of the individual to apprehend musical 


structures.’’* According to one authority these 
are the kinds of items which truly measure mu- 
Sical aptitude or musical ability.’ This same 
authority maintains that two widely used tests 
in this area do not measure musical abilities 
but rather measure acoustical or sensory abil- 
ities. 


Interest for Music 


Though most experienced music educators 
would not be opposed tothe proposition that a 
youngster’s attitude toward music, or interest 
for music, is highly indicative of the degree of 
use to which he puts his musical ability, only 
one test which measured this important predic- 
tive factor could be found. This was the first 
section, i.e. , page one, of A Test of Musicality. 

Because norms for this section of the test, as 
well as for the musicality section, are present- 
ed both by grade levels and by sex, it was nec- 
essary first to convert all raw scores to the 
percentile ranks appropriate to the grade place- 
ment and sex of the youngsters taking the tests 
and then to convert these percentile ranks to Z- 
scores. This procedure made it possible to 
attain comparability of boys’ and girls’ scores 
as well as comparability of eighth graders’ and 
seniors’ scores, It also was needed for the com- 
posite scores mentioned previously. 





Home Support for Music 


Hardly any educator would deny that a power- 
ful influence is exerted by the attitudes of par- 
ents upon the behavior of their children. Inspite 
of this fact, there was found no instrument which 
attempted to measure parental attitudes toward 
music. For this reason an attempt was made to 
construct such an instrument. Extreme care 
was taken to insure that the questionnaire mea- 
sured what it purported to measure, and did so 
consistently. Two different groups of experi - 
enced music educators examined each item inan 
effort to assess its validity and discriminating 
power. Statistical estimates of validity were 
determined by contrasting scores withteachers’ 
dichotomous categorizations of the homes. All 
t-tests were significant at the .01 level. The re- 
liability of the questionnaire was determined by 
dividing it into two roughly equivalent halves, 
scoring each half, and then computing the coef- 
ficient of correlation between the scores on these 
two sections. Inasmuch as four sets of norms 
had been necessary to ‘handle both sexand grade 
differences, the correlations were performed on 
four randomly selected groups, one from within 
each norm group. The coefficients ranged from 
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. 80 to .89, and all were significant at the . 01 
level. 


Socio-cultural Characteristics 


Information concerning the occupation of the 
head of the household, color, religion, nation - 
ality background, number of children inthe home, 
and music courses taken was obtained by comple- 
tion of a questionnaire by the subjects. Wherever 
school records permitted it, the student report 
was validated against school data. 

The occupational scale developed by Warner, 
Meeker, and Eells was utilized as a measure of 
socio-economic status. Protestant sects were 
categorized in accordance with several articles 
in a recent encyclopedia of religion. Nation- 
ality background was categorized for white sub- 
jects only. Grandparents’ piace of birth was the 
arbitrary means employed to place a white sub- 
ject in ‘‘native’’, Anglo-Saxon, Latin, or Slavic 
categories. 


Normality of Data 





In order to determine whether the intelligence 
Z-scores, musicality Z-scores, interest for music 
Z-scores, and home support Z-scores departed 
from normality, chi-square tests of significance 
were applied to. these distributions. Since the as- 
suription that these four variables are distributed 
norinally in a school population had been made, it 
was decided to transform into normalized T- 
scores those Z-scores which did depart signifi- 
cantly from normality. This procedure was ne~ 
cessitated by subsequent t-tests which are based 
on such an assumption. 


Composite Scores 





For each individual in the study it seemed ad- 
vantageous to obtain a single composite score 
representing the individual’s intelligence, music- 
ality, interest for music, and home support for 
music. In deriving a composite score, consider- 
ation was given first to discriminant function 
weightings, i.e., weightings for each of these 
four variables which would maximize the differ- 
ences between participants and non-participants. 
However, the use of composite scores based on 
discriminant function weightings, in further sta- 
tistical comparisons which hold constant com- 
posite score, would bias the outcomes of such 
comparisons and, therefore, was rejected. The 
arithmetic mean of each individual’s four T- 
scores was the composite score utilized. 

Although each variable could have been con- 
sidered separately, the interdependence of these 
four variables in discriminating between partic- 
ipants and non-participants, as shown by prior 
research, made more valid the use of the com- 
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posite score. The awkwardness of simultan- 
eously holding constant four different variables 
was another reason for using the composite. 


Statistical Tools Used 





At each grade level in each community t- 
tests were calculated to test the hypothesis that 
there was no difference between participants 
and non-participants with regard to means on 
composite scores. The assumption which had 
to be satisfied before the t-test could be applied 
was that there was no difference between stand- 
ard deviations of the two groups. This assump- 
tion of homogeneity of variances was tested by 
application of the F-test. In the event that this 
assumption was not satisfied, the Behrens- 
Fisher or d-test was used in place of the t-test. 

At each grade level the procedure explained 
above was utilized also in a comparison between 
participants and drop-outs with regard to means 
on composite scores. 

Analysis of variance and covariance is the 
technique usually used in cases involving com- 
parisons between groups on one variable,e.g., 
socio-economic status, holding another variable, 
e.g. , Composite score, constant. Because dis- 
tributions of socio-economic status, family size, 
color, religion, and nationality background could 


not satisfy the assumptions of normality andcon- 
tinuity, analysis of variance and covariance could 
not be utilized. Instead, analysis of variation by 


the method of ranks was used. 12 This technique 
is applicable to data of a qualitative character 
and to data which are not normally distribute d. 

Inasmuch as this statistical tool, analysis of 
variation by the method of ranks, is more sensi- 
tive when individual cell entries are numerically 
large, pooling of schools was utilized. 

For comparisons involving socio-economic 
Status, only those schools with homogeneous 
distributions of socio-economic categories were 
pooled. This was necessary because signifi - 
cantly different distributions of socio-economic 
categories between schools might obscure the 
association between participation and socio-eco- 
nomic status within either school, should such 
differing schools be pooled. Chi-square was the 
statistical tool applied to test the hypothesis that 
there was no difference in distributions of socio- 
economic categories between schools. When this 
hypothesis was accepted, pooling was accom - 
plished. 

Table I illustrates the application of analySis 
of variation by the method of ranks to a groupof 
eighth graders homogeneous with respect to 
socio-economic status. The results of this test 
of significance are reported also in Table I, a 
summary table of similar tests of significance. 

In considering the association between color 
and participation it was decided to equate the 
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Negro students in the study (N=24) with white 
students of equal composite score, of equal socio- 
economic status, and of the same grade level. 
Chi-square was the statistical tool applied to this 
equated group to test the hypothesis that there was 
no difference in the proportions of Negro and white 
participants. 

In some schools, t-tests between the means of 
composite scores of participants and non-partici- 
pants resulted in findings of no significant differ- 
ences between these groups. It was not neces - 
sary, then, to hold constant composite scores 
when determining whether or not these two groups 
differed significantly with regard to religion, 
color, socio-economic status, family size,or na- 
tionality background. In these situations the sta- 
tistical tool applied was the significance of the 
difference between percentages. 


Findings 


Reference to Table II shows that in eleven of 
the fifteen comparisons made, participants were 
superior to non-participants with respect to com- 
posite score at the .01 level of significance. One 
comparison was significant at the . 02 level; two 
comparisons were significant at the .05 level; 
one comparison failed to reach the . 05 level of 
significance. 

It was decided to hold constant composite 
score, representing interests and abilities, in 
the eleven comparisons significant at the .01 
level and in the comparison significant at the . 02 
level when determining the association between 
various status characteristics and participation 
in instrumental music activities. The results of 
the application of analysis of variation by the 
method of ranks to these twelve groups, reported 
in Table III, indicate that, with composite score 
held constant, there was found no significant as- 
sociation between participation and each of the 
following characteristics: socio-economic status, 
religion, family size, and nationality background. 

In dealing with the three comparisons which 
failed to reach the . 02 level of significance with 
respect to differences between participants and 
non-participants in composite score, composite 
score was not held constant when determining 
the association between participation and each of 
the four characteristics mentioned above. The 
results of the application of the significance of 
the difference between percentages to the data in 
these three schools is reported in Table IV. 
These results show that for the two senior groups 
there was found no significant association be- 
tween participation and the four status character- 
istics. In the eighth grade group, however, the 
association between socio-economic status and 
participation as well as one of the comparisons 
dealing with the association between nationality 
background and participation were significantat 
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TABLE I 


RANKS OF PERCENTAGE OF PARTICIPATION IN INSTRUMENTAL 
MUSIC FOR SPECIFIED LEVELS OF COMPOSITE SCORE AND 
OF SOCIO-ECONOMIC STATUS AMONG EIGHTH GRADERS 


IN TOWNS A, C, AND D 








Composite 
score 
intervals 


Ranks based on percentage of participation in instrumental 
music by socio-economic status 


vil VI IV Ill 








65+ 
55-64 
45-54 
35-44 
under 35 


Sum of ranks 
Mean rank 
Deviation 


20 
4 


6 
6 0.0 


Theoretical m 
Sum of deviati 


cn = 1/2(p+1) = 4.0 
mS squared = 6.9 
"SW M.. 


2 
— (2d ) 
d p(p+!) 

2 
Xr — 7.4 


d. f. =6 .30>P>.20 
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TABLE I 


COMPARISONS BETWEEN PARTICIPANTS AND NON-PARTICIPANTS 
WITH RESPECT TO COMPOSITE SCORES 





Mean of Mean of non- 
participants participants 





Eighth Grade 

53. 64 48.72 

55. 58 48.51 
. 07 47.98 
.25 47.68 
. 38 47. 46 
.31 45.61 
. 67 50. 4 
«52 53.5 


? 


24, 46 
11, 48 
29, 64 

115, 23 
38, 15 
22,12 
24,2 
22,5 


> > 
ZZ 
2 

ne 


eee ee ee 
. ¢ 2 oe. Got ete 


B 
Cc 
D 
E 
F 
G 


NNO » Ww 
a. -. © 2: 8 


Twelfth Grade 

83,16 . 55. 88 48.86 
13,26 2 56. 79 48.59 

9,82 a 55.7 49.59 
51,4 . 53.8 46. 88 
18,23 3 55. 96 45. 32 
28,3 ‘ 55.5 47.93 
14, 33 " 55. 53 50. 32 


QAmOQNwDS 





* Significant at the .05 level 
** Significant at the . 02 level 
*** Significant at the .01 level 


f Since the assumption of homogeneity of variances was not met, the Behrens- 
Fisher d-test was used to test the significance of the difference in means. 
The obtained value (d=5. 8) is significant at the .01 level. 
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TABLE I 


ANALYSES OF THE ASSOCIATION BETWEEN PARTICIPATION AND VARIOUS STATUS 





CHARACTERISTICS WITH ABILITIES AND INTERESTS HELD CONSTANT 








Comparison d. f. xr Probability 

Socio-economic status 

(seniors in towns A and C) 4 2.00 -80>P>.70 
Socio-economic status 

(seniors in towns B, E, and G) 4 4.2 .50>P>.30 
Socio-economic status 

(eighth graders in towns A, C, and D)* 6 7.40 .30>P>.20 
Socio-economic status 

(eighth graders in towns B, E, and F) 6 4.29 .70>P>.50 
Family size 

(seniors in towns A, B, C, E, andG 4 3.55 .50>P>.30 
plus all eighth graders except in town G) 

Religion 

(seniors in towns A, B, C, E, andG 5 8.29 -20>P>.10 
plus all eighth graders except in town G) 

Nationality background 

(seniors in towns A, B, C, E, andG 3 1.16 .80>P>.70 


plus all eighth graders except in town G) 





* A detailed presentation of analysis of variation by the method of ranks, dealing with 


this comparison, is reported in Table I. 
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the . 02 level. 

In eight of the fifteen sample schools there 
were found to be Negro youngsters. Inorder 
to determine whether or not these twenty-four 
Negroes were participating in instrumental 
music to as great an extent as white youngsters 
of equalabilities and interests, it was necessary 
to select from the eight sample schools con- 
cerned those white students whose composite 
scores were of equal value to some Negro stu - 
dent. In order to remove the influence of socio- 
economic status upon the outcome of a test of 
Significance, it was decided that equivalent socio- 
economic status would be another criterion for 
equating the groups. Chi-square was the statis- 
tical technique used to ascertain whether or not 
there was a significant difference between the 
proportion of participants to non-participants 
among the Negro youngsters and this proportion 
in a matched group of ninety-five white young - 
sters. The data and test of significance are re- 
ported in Table V. The results indicate that 
there was no significant association between 
color and participation in instrumental music 
where grade level, composite score, andsocio- 
economic status were controlled. 

In the comparisons dealing with differences in 
mean composite score between eighth grade par- 
ticipants and eighth grade drop-outs as wellas 
mean composite score differences between sen- 
ior participants and senior drop-outs, the ob- 
tained values of t were significant at the . 01 level. 
In each case, participants were superior to drop- 
outs with respect to the composite of abilities 
and interests. These findings indicated that com - 
posite score would have to be controlled in all of 
the comparisons involving participants and drop- 
outs, comparisons dealing with the association 
between drop-out and each of the following char- 
acteristics: socio-economic status, family size, 
religion, and nationality background. The results 
of these comparisons, reported in Table VI, indi- 
cate that there was found no significant associa- 
tion between drop-out and each of these status 
characteristics. 

Because there were only ten Negro students 
in the total group of drop-outs and participants, 
it was deemed impractical to use a statistical 
test of significance in determining whether or not 
there was an association between color and drop- 
out. Inspection of the data, however, seemed to 
suggest that color was not associated with drop- 
out, 


Interpretations and Implications 





Abilities and interests deemed appropriate to 
participation in instrumental music classes were 
found to differentiate participants from non-par- 
ticipants in twelve of the fifteen schools and to 
differentiate participants from drop-outs inall 
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schools. Though this finding constitutes a nec- 
essary step in the process of determining wheth- 
er or not the public schools of Kansas are pro- 
viding to all children equal opportunities for par- 
ticipation in instrumental music classes, itdoes 
not, taken by itself, provide an answer to this 
problem. ‘ 

The purpose of ascertaining whether or not 
participants differed significantly from non-par- 
ticipants and from drop-outs with regard to in- 
telligence, musicality, interest for music, and 
home support for music, those characteristics 
of youngsters which may be regarded as justifi- 
able limitations upon participation, was to util- 
ize valid procedures in attacking the crucial 
question in this study. This questicn was as fol- 
lows: do individual differences in status due to 
color, religion, socio-economic status, nation- 
ality background, or family size have a signifi- 
cant association with participation or with drop- 
out? 

In schools in which participants did not differ 
significantly from non-participants with respect 
to the composite of abilities and interests, the | 
manner of answering this crucial question was to } 
ascertain whether or not children of differing re- 
ligious preference, differing socio-economic lev- 
el, differing color, differing nationality back - 
ground, and differing size of family were repre- 
sented in instrumental music organizations in 
proportion to their numbers in the total school 
enrollment. This simple procedure could not 
be used in schools in which participants didshow 
a significant superiority in abilities and inter - 
ests over non-participants or over drop-outs. In 
these latter cases, it was necessary toconsider 
different levels of ability and interest, i.e., to 
hold constant the composite of interests and abil- 
ites, in ascertaining whether or not children of 
varying socio-cultural characteristics were 
represented in instrumental music organizations 
in proportion to their numbers in the different 
ability-interest levels. 

The former technique was used in dealing with 
three of the sample schools; the latter technique, 
with twelve of the sample schools. With the ex- 
ception of one eighth grade group, for which 
judgment must be suspended, the answers to the 
crucial question in this study were the same. 

Individual differences in status due to color, 
socio-economic level, religion, nationality back- 
ground, and size of family were found to have no 
significant association with participation in in- 
strumenial music. In all schools, individual 
differences in these socio-cultural characteris- 
tics were found to have no significant associa- 
tion with drop-out from instrumental organiza- 
tions. 

These findings are encouraging to the student 
of human affairs in that the evidence revealed 
that an elective and selective area of public 
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TABLE IV 


’ 





ANALYSES OF THE ASSOCIATION BETWEEN PARTICIPATION AND VARIOUS 
STATUS CHARACTERISTICS IN SCHOOLS IN WHICH PARTICIPANTS AND 
NON-PARTICIPANTS DID NOT DIFFER IN ABILITIES AND INTERESTS 











Comparison XxX Probability 
Socio-economic status (seniors) 1.74 . 08 
Socio-economic status (eighth graders) 2.34 . 02 
Family size (seniors) .32 - 75 
Family size (eighth graders) . 48 . 63 
Religion (seniors) . 82 - 41 
Religion (eighth graders) . 54 . 59 
Nationality background (seniors) .14 . 89 
Nationality background (eighth graders) 

Native vs. Anglo-Saxon .33 . 02 

Native vs. Slavic 1.39 . 16 

Slavic vs. Anglo-Saxon 1.12 . 26 





TABLE V 


PARTICIPANTS AND NON-PARTICIPANTS WITHIN A GROUP OF WHITE AND NEGRO 
YOUNGSTERS EQUATED FOR ABILITIES, INTERESTS, SOCIO-ECONOMIC 


STATUS, AND GRADE LEVEL 











Color Participants Non-participants Totals 

Negro 7 (a) 15 (b) 22* 

White 14 (c) 81 (d) 95* 
Totals 21 96 117 


(ad - be)” (a+b+c+d) 





Xo 


(a+b)(c+d)(a+c)(b+d) 
x > = 3.54 
d.f. =1 .10>P>.05 





* Although it was possible to find several white pupils, each of whom could be matched 
to the same Negro pupil, it was impossible to find any white students who could be 





a match for two of the Negro youngsters. 
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TABLE VI 


ANALYSES OF THE ASSOCIATION BETWEEN DROP-OUT AND VARIOUS STATUS 
CHARACTERISTICS WITH ABILITIES AND INTERESTS HELD CONSTANT 











Comparison d.f. «+ Probability 
Socio-economic status (all seniors) 5 3.68 .70>P >.50 
Socio-economic status 

(eighth graders in towns B, D, andG) 4 5.8 .30>P>.20 
Socio-economic status 

(eighth graders in towns A,C,E,andF) 4 4.4 .50>P>.30 
Family size (all eighth graders) 4 1.77 .80>P>.70 
Family size (all seniors) 3 . 99 .90>P>.80 
Religion 

(all eighth graders plus all seniors) 4 4.31 .50>P ».30 
Nationality background 

(all eighth graders) 2 1.31 .70>P>.50 
Nationality background 

(all seniors) 2 1.36 .70>P>.50 
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school instruction was providing children with 
equal opportunities for participation. These re- 
sults are in diametric contradiction to much of 
the present published data regarding student par- 
ticipation in public school activities. This fact 
Suggests that additional checks of the validity of 
this study should be made. 

The interests and abilities selected as appro- 
priate to participation in musical activities may 
not be the best choices possible. Kansas com- 
munities may be too homogeneous to test ade- 
quately the issues raised in this study. Since the 
Statistical technique of analysis of variation by 
the method of ranks is more sensitive to large 
individual cell entries, a larger sampling might 
be made. 

The contradiction between the results of this 
study and much past research is due alsoto the 
philosophical basis and methodology of the pres- 
ent study. Past research has failed, for the most 
part, to take into account individual differences in 
the abilities and interests of children for the pub- 
lic school activity being investigated. Having 
made such an omission, differences in socio- 
economic status or insome other socio-cultural 
characteristic between participants and non-par- 
ticipants, when discovered, have been regarded 
as proof of unequal opportunities for par ticipa- 
tion. This is the same, it seems, as stating 
that equality of opportunity in a democracy is the 
equivalent of identical participation. 

According to the philosophical position taken 
in the present study, such a viewpoint is invalid. 
In this study, equality of opportunity was inter- 
preted to mean the full development bythe school 
of each individual’s abilities and interests with- 
out regard to color, religion, and similar socio- 
cultural characteristics. 

Analysis of variation by the method of ranks 
was well suited to the central problem of this 
study. By use of this statistical tool, it was pos- 
sible to determine the rate or percentage of par- 
ticipation in each of several different levels of 
interests and abilities for each of several cate- 
gories of socio-economic status, religion, and 
other characteristics. With data organized in 
this fashion, it was possible to determine wheth- 
er interests and abilities were the more influen- 
tial characteristics associated with participation 
in music or whether socio-economic status, for 
example, was more influential. 

It is conceivable that the philosophical posi- 
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tion and statistical methodology employed in the 
present study might prove to be useful in areas 
other than participation in instrumental music. 
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SIMULTANEOUS ABSOLUTE SCALING FOR 
SEVERAL GROUPS’ 


W. A. GIBSON 
Center for Advanced Study in the Behavioral Sciences 
Stanford, California 


THE METHOD of absolute scaling was 
developed some years ago (ref. 1) as one solu- 
tion to the problem of a mental unit of measure- 
ment. The method defines the psychological 
scale of measurement to be such that the distri- 
butions of ability for several age groups on that 
scale are normalinform. A check onwhether 
such a definition is possible for a given set of 
data is made available, and a way of determin- 
ing the parameters of the resulting scale is pro- 
vided. The present paper is concerned with an 
alternative way of determining these same par- 
ameters. 

The logic of absolute scaling is not difficult. 
If a scale of measurement exists such that the 
distributions for several age groups onit are 
normal, then the location of any point on that 
scale can be expressed very simply in terms of 
its position in any one of these distributions. 
Let the point Xj have normal deviates Z;, inage 
group 1, Zj2 in age group 2, etc. Then 


Mj + 012Zj1 = M2 +02Zj2 = =Mr + OrZir =... 


ee =Mn + OnZin, (1) 


where Mr and Or are the mean and standard de- 
viation of the distribution for age group r, and 
n is the number of age groups ina given prob- 
lem. If the point Xj represents the scale posi- 
tion of a raw score, then its Zjr are the norm- 
alized standard scores for that raw score in 
each of the age groups. If Xj is the difficulty 
value of a test item, then its Zjy are the normal 
deviates associated with the proportions of the 
successive age groups who fail that item. Abso- 
lute scaling can be applied either to the percent- 
ile ranks of raw scores or to item difficulty data. 
Equations (1) define a straight line in an 
n-space in which Zijl, Zi2,..., Zir,...,Zjnare 
the rectangular coordinates of a point onthe line. 
The equations then state, in effect, that the n- 
dimensional graph of the normalized standard 
scores or normal deviates for the n age groups 
must be linear. Being empirically determined, 
the plotted ‘points will not lie exactly ina straight 
line, but they must come close to it before an 
absolute scale can be fitted with negligible error. 


*This research was 





The original absolute scaling method (ref. 1) 
involves the plots of the normal deviates for all 
pairs of adjacent age groups. These plots are 
used to determine the parameters of the abso- 
lute scale. Equations (1) are solved explicitly 
for Zji in terms of Zj2 to give 


o Mo-M 
Qu = Zig + — (2) 
oy oy 


A straight line is fitted to the plotted points 
on the Zj1 - Zj2 graph, and its equationis writ- 
ten in slope-intercept form. The slope then 
represents the ratio of the two standard devia- 
tions, while the intercept gives the difference 
between the two means in units of 9]. The pro- 
cess then shifts to the Zj2 - Zj3 plot, then to 
the Zj3 ~ Zi4 plot, and so on up to the Zj(n-1) 
-Zjin plot. Thus the scale is constructed step- 
by-step, each new fitted line providing the mean 
and standard deviation for one more age group. 
The process requires n-1 line fittings. The 
method to be described here yields the param- 
eters for the n age groups simultaneously by fit- 
ting the line only once in n-space. A disadvant- 
age of the present method is that it is somewhat 
wasteful of the available data, but this can be 
overcome, in large measure, in ways that will 
be discussed later. 

For the purpose of fitting a line in n-space 
it is convenient to borrow some concepts from 
the n-dimensional vector geometry that has 
been so useful in multiple-factor analysis 
(ref.2). There the factor loadings of tests are 
the projections of vectors on reference axes. 
The vectors of factor analysis all radiate out 
from a common point—the vertex of the vector 
configuration, and the vertex happens to be lo- 
cated at the origin of the coordinate system. In 
the present context each raw score or testitem 
may be represented by a vector, the coordin- 
ates of whose terminus are the corresponding 
Zir. Before we can talk about the vector con- 
figuration, however, we must locate its vertex 
in some relevant way. The configuration will 
differ, depending upon where the vectors orig- 
inate, even though the vector terminiare fixed. 
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A natural answer here is to locate the vertex 
on the best-fitting line. In this way the vector 
configuration approaches unidimensionality. 

An established principle of curve fitting in 
two dimensions is that the best-fitting straight 
line for a set of points shall pass through the 
centroid of those points. This principle is read- 
ily generalized to n-space. The vertex of the 
vector configuration can therefore be made a 
point on the best-fitting line by placing it at the 
centroid of the vector termini. It is not diffi- 
cult to show that an additional property of this 
position of the vertex is that it minimizes the 
sum of the squared vector lengths. 

With the vertex and the centroid identical, it 
is convenient for subsequent computations to 
translate the origin of the coordinate system in- 
to coincidence with the very same point. Then 
the best-fitting line will pass through the origin. 
It is only necessary to locate one more pointon 
that line in order to write its equation. Again 
it is convenient to borrow some concepts from 
factor analysis. If the configuration is fairly 
long and narrow, then most of the vectors will 
lie in only two octants of the n-space—the one 
in which the vector projections on all axes are 
positive, and the one in which all such projec- 
tions are negative. It is an easy matter to re- 
verse the direction of those vectors that have 
nothing but negative coordinates. Thenthe first 
octant will contain all or nearly all of the vec- 
tors. The centroid of their termini can be tak- 
en as the second point on the best-fitting line. 

The data in Table I can be used to illustrate 
the computing routine. Table I contains the 
normal deviates corresponding to the percentile 
norms for ages 7-11 on the PMA Perception 
Test (ref.3). Many entries at the upper leftand 
lower right in the table are left blank because 
of the instability of normal deviates for very 
large or very small percentile ranks. Normal 
deviates for percentile ranks above 97 or below 
3 are not recorded in Table I. 

A best-fitting line could be determined for 
all ten age groups simultaneously, but sucha 
procedure would use only the data within the 
dotted lines in Table I. A better approach is to 
block off the data into overlapping rectangular 
sections (blocks A, B, and C in Table I), so as 
to include as many of the usable entries as pos- 
sible. In this way most of the normal deviates 
can be used in the fitting. A line is fitted to 
each block separately. Then, in what amounts 
to a repeated application of the same concepts, 
these three results can be combined to give a 
Single fitted line for the whole table of normal 
deviates. 

Figure 1 shows the graph of columns 1 and2 
in Table I. The fact that this and the other plots 
are approximately linear indicates that an abso- 
lute scale can be fitted to the present data. Also 
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shown in Figure 1 are the results of the fitting 
procedure that is being used. The two parallel 
slanted lines indicate that the points between 
them are represented in one of the blocks. Here 
it is block A. The other constructions will be 
explained as they are made possible by the cal- 
culations. 

At the bottom of Table I are recorded the 
sums and means of the column entries within 
the three rectangular sections. The column 
means, my, for each block are the coordinates 
of the centroid of the points represented in that 
block. 

In Figure 1 the dotted lines labeled ajj and 
aj2 have been made to intersect in the centroid 
of the points represented in block A. The pro- 
jections of all points on axes translated to this 
new position are obtained simply by subtracting 
the mean of each column from every entry in 
the column, That is, 


air = Zir -~M™r. (3) 


The resulting air for block A are shown in Table 
Il. From that table it is apparent that, with one 
exception, the projections of all points on the 
new axes are either all positive or all negative. 

In this translation of axes it is convenient to 
carry along certain points not represented with- 
in a particular block, and to determine their po- 
sition in the new coordinate system. These are 
the points whose coordinates on some of the 
axes represented in the block are missing. Sec- 
tion 1 of Table III shows the ajr for such points 
on the new axes of Table II. These coordinates 
are obtained by means of equation (3) from the 
corresponding entries in columns 1-5 of Table 
I. 

It is desirable to relate two other sets of 
points to the same new axes. These are the 
points having Zir of zero and of 1--the mean, 
Mr, and the point (Mr + Or) in the distribution 
of ability for each age group. Imagine Table I 
as being extended at the bottom so as to list the 
Zir for these points in the appropriate columns. 
Then the coordinates of these points with respect 
to the new axes can be obtained in exactly the 
same way as were the entries in Table II andin 
section 1 of Table I]. The results are shown 
in section 1 of Table IV. In symbols, 


amr =0-Myr=- mr, (4) 


aor = 1 7 mr. (5) 


At the bottom of Tables, I, and IV are 
shown summary checks for the ajy contained 
in those tables. These checks are made possi- 
ble by summing equation (3) over the q values 
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TABLE I 


NORMAL DEVIATES FOR PMA PERCEPTION TEST 





Age Group 





5 





Age Range 





7-7.5 -7.5-8 8-8.5 8.5-9 9-9.5 9.5-10 10-10.5 10.5-11 11-11.5 11.5-12 





1. 64 





. 88 
. 64 
. 48 





34 
. 08 
. 84 











.75 


1.55 
1, 41 
1,23 
. 95 
. 74 
. 52 


4 
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Figure 1 


Graph of Normal Deviates in Columns 1 and 2 of Table I 
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TABLE I 


TRANSLATED COORDINATES FOR SECTION A OF TABLE I 





Age Group 
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TABLE V 


ABSOLUTE SCALE VALUES FOR RAW SCORES 





Raw Raw Scale 
Score Score Value 


43 
41 
39 
37 
35 
33 
31 
29 
27 
25 
23 





21 1.15 
19 . 84 
17 . 54 
15 21 
13 - .12 
11 - .43 
- .63 
- .85 
-1. 04 
“1.24 
-1.88 





TABLE VI 


ABSOLUTE SCALE VALUES FOR M AND o 





Age 
Range 


§ - 13 
11, 
.5 1l 
10. 
-5 10 





_ 
Ke NOWKhUID -10 © © 
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of air that are involved for any value of r. Thus 


q q 
Lair = 2 Zir - qmr. (6) 


For example, the sum of the entries in column 
1 of section 1 in Table IV is .56, and by equa- 
tion (6) this should agree with the check value 
(1+0 -2-.22)if the ajr have been correctly cal- 
culated. The sum row in Table II is the sum of 
rows (15), (2+), and (2-) in that table, the lat- 
ter two rows being used in subsequent calcula- 
tions. 

With all points referred to new axes whose 
intersection defines one point on the best-fitting 
line, the next task is to locate a second pointon 
that line. At the bottom of Table II are shown 
the sums of the positive (2+) and of the negative 
(2-) coordinates for block A. The point (15) is 
excluded from the formation of these sums. The 
absolute sums for these two rows appear in the 
row marked (Zi). These absolute sums are 
proportional to the coordinates of the point that 
is wanted. That point is the centroid of the ter- 
mini of all vectors that lie in the first octant 
after the reflection of those vectors having all 
negative coordinates. In the present example, 
6 vectors alread lie in the first octant, while 6 
others can be put there by mere reversal of 
their direction. Row (211) gives the sums of the 
coordinates of these 12 vector termini on each 
of the coordinate axes, Division of these sums 
by 12 would give the coordinates of the centroid 
being sought. Justas they stand, however, these 
sums constitute direction numbers for the best- 
fitting line, and they have been so used to deter- 
mine its slope in Figure 1. They may be inter- 
preted as the coordinates of a vector colinear 
with the centroid vector, but 12 times as long. 
For convenience this long vector will be called 
the line vector. 

The two ends of the line vector have been 
used to define the best-fitting line. The next 
task is to determine the projections ofall points 
on that line. Since these points represent vec- 
tor termini, it is convenient to construe this 
task as being that of computing the scalar prod- 
ucts of all vectors with the line vector. Such 
scalar products are proportional to the projec- 
tions of the vectors on the best-fitting line. The 
proportionality constant is the length of the line 
vector. Since the scale of measurement along 
the best-fitting line is essentially arbitrary, 
the scalar products themselves are sufficient 
for indicating relative spacings along that line. 

If the main body of Table II is called the ma- 
trix A, column V is made into a column vector 
V, and row (ZII) is transposed into a column 
vector B, then the matrix product, 


AB=V, (7) 
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gives the scalar products of all vectors in block 
A with the line vector. In ordinary algebra, 
the i-th scalar product is the i-th entry in col- 
umn V and is give by 


Vi = ajjby + ajgb2 +....+ajrby +....+ajnbn, (8) 


where bry is the r-th entry in row (Z1I). Sum- 
ming over i in equation (8) gives 


Vi = bjZ aji + bg aj2+....+ Dre airt....+ 
i 


bp ain. (9) 
1 


This provides a summational check on the cal- 
culation of column V. The quantity on the right 
in equation (9) is shown in the (Ch. ) position at 
the bottom of column V. Its agreement with the 
sum of column V constitutes the check. 

If point i lies on the best-fitting line and its 
projection on any axis r is known, the scalar 
product of its vector with the line vector canbe 
determined. Suppose that br has to be multi- 
plied by a quantity cj in order to give ajr. Then 
for all axes, including r, 


ajj =Cjb1,aj2=cjb2,...., ajr = cjby,...., ain 
= cjbp. (10) 


From equations (8) and (10) the scalar product 
of vector i with the line vector is 


n 
vj4= cibj $52 aOEEE Fen .+c{bp = Cj % br(11) 
rs 


The quantity cj is defined by equation (10) as 
c= Sir , (12) 
br 


so that equation (11) becomes 


Then vi = krajry . 


The quantity kr is constant for all values 
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The quantity 2b* is shown at the right of row 
(211) in Table I. In row (kr) at the bottom of 
Table II are recorded the values of kr. In sec- 
tion 2 of Tables III and IV are recorded the re- 
sults of multiplying the projections in section 1 
by the appropriate kr. At the bottom of section 
2 in Tables III and IV are shown the usual sum- 
mary checks, based upon the summation of 
equation (15) over the q values of ajr that are 
involved for any value of r. Thus, 


q q 
ZVj = kr z= air. (16) 


The entries in section 2 of Tables I and IV 
are estimates of the scalar products of the cor- 
responding vectors with the line vector, based 
on the assumption that all of the points lie on 
the best-fitting line. Where only one estimate 
is available, it has been recopied into the col- 
umn marked V in Tables II] andIV. When two 
or more estimates are availabie fora given 
point, their sum is recorded in the sum col- 
umn and their average is recorced in column V. 
Summary checks are not possible here, but the 
regular progression of the entries in column V, 
especially as between Tables II and II, provides 
an inspectional check, and of course each entry 
in column V of Table III must be similar inmag- 
nitude to the quantities of which it is the aver- 
age. 

The entries in column V of Tables I, I, 
and IV are a fitted unidimensional representa- 
tion of the first five columns of Table I. Simi- 
lar calculations based on sections B and C of 
Table I give two other fitted dimensions, Hence 
in one computational cycle the original ten di- 
mensions are reduced to three. These three in 
turn can be collapsed into one by a single repe- 
tition of the same kind of calculations. Nothing 
essentially new is involved in such calculations, 
and so they will not be shown here. The final 
absolute scale is shown in Tables V and VI. The 
origin and unit of measurement have been ad- 
justed so that age group 1 has a mean of 0 and 
a standard deviation of 1. The standard devia- 
tions in Table VI were obtained from the form- 
ula, 
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Or = (Mr + Or) - Mr, (17) 


applied to the adjusted scale. 

The computing procedures that have been de- 
scribed here are independent of the graphs and 
could be completed without them. It is never- 
theless a valuable addition to have some of the 
graphs (for all pairs of adjacent columns, for 
example) as visual checks. In the first place, 
the graphs indicate whether an absolute scale is 
possible at all. If deviations from linearity are 
serious and cannot be traced to plotting or cler- 
ical errors, then the computing process need 
never begin. In making the constructions it is 
a frequent experience to find something wrong 
with them on all of the graphs involving a par- 
ticular column of normaldeviates. This usually 
indicates that a computing error has occurred 
somewhere in that column. Furthermore, the 
nature of the graphical disparity usually suggests 
the kind of computing error that has been made. 
The incorrect translation of axes is indicated by 
the crossing of the new axes a.way from the mid- 
dle of the configuration, and errors in comput- 
ing the direction numbers show up in the tilt of 
the fitted lines. Computing errors that are ac- 
tually very small can be detected from the graphs. 
Finally, the fit of the lines is a visual index 
of how well the theoretical scale of measure- 
ment is reflected in empirical fact. 
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AN EXPERIMENTAL EVALUATION OF TWO 
METHODS OF TEACHING MUSIC IN THE 
FOURTH AND FIFTH GRADES 


CARL B. NELSON 
University of Minnesota 


WHILE MANY judgments have been made 
by music educators as to the comparative value 
of instrumental music training versus vocal 
music training, very little systematic research 
has been done in classroom experimentation to 
obtain evidence of the effects of these judgments. 
The purpose of this experiment was to obtain 
objective data with respect to the relative worth 
of a course of study which utilized not only vocai 
music training but also instrumental music in- 
struction. 

The stated aim of many music teachers and 
Supervisors has been ‘‘music for every child’’. 
It is of significance to note that surveys con - 
ducted in recent years reveal that the majority 
of children in our schools receive training only 
in vocal music. These surveys consistently 
show that instrumental music education is re- 
served for children who demonstrate outstand- 
ing ability in music or for those who canafford 
to purchase instruments and lessons. It also 
is apparent that public school instrumental 
training programs in a majority of cases are 
sponsored so that bands or orchestras can be 
developed, particularly at the high school level. 
Thus, in the latter situation, this type of mu- 
sical training is not primarily adapted to meet 
the needs of the individuals but rather pointed 
toward the development of a performing group. 

Although most children have some sort of 
vicarious instrumental music experience in pub- 
lic schools, it is generally not accepted that a 
background in instrumental performance is nec- 
essary for every child. It seems that music 
teachers universally assume that instrumental 
music education is more demanding on pupils 
so that those who show a greater talent for mu- 
sic are chosen for this training. 

Of course there are obvious reasons why such 
an attitude would be held frequently by music ed- 
ucators. It is considered prohibitively expen - 
sive to provide an opportunity for every child to 
have both instrumental and vocal music instruc- 
tion. In addition, it would be necessary to ob- 
tain staff members to teach in both areas. The 
item of additional expense for the latter is not 
the only consideration; music educators also be- 
lieve that it would be difficult to recruit person- 
nel who would be qualified to handle this sort of 


* Footnotes will be found at end of this article. 





teaching. These are major items whichappear 
to stand in the way of a practical approach to a 
combined vocal-instrumental music course of 

study. 

It was not the purpose of this study to pro- 
vide possible solutions to these problems. While 
the writer realizes that they are not easily over- 
come, the issue investigated is: does this type 
of program fill a need in our present elementary 
music course of study? It would seem logical to 
show first that such a system of instruction is 
worth the effort. Therefore, this study was un- 
dertaken to secure reliable objective information 
concerning the relative value of a music course 
including vocal and instrumental participation by 
the pupils as compared to one including only vo- 
cal instruction. 


The Population Sample 


The population selected for the study was the 
fourth and fifth grades of the Concord School, 
Edina, Minnesota. The Concord School is a six- 
year public elementary school with an enrollment 
of approximately 500 pupils. 

The sample was intact since all the fourth and 
fifth grade children of this school were included 
in the experiment. However, the writer found 
that this particular group of pupils could be con- 
sidered a representative sample of fourth and 
fifth grade children in the Edina elementary 
school system. Random samples of children who 
had been enrolled inthe fourthand fifthgrade two, 
four, and six years previously were compared to 
the present sample on the basis of intelligence (as 
measured by the Otis Quick-Scoring Mental Abil- 
ity Test"* and grade point average. The tech- 
nique of analysis of variance was applied to the 
data. Conclusions based on the findings of this 
experiment thus must apply only to fourth and 
fifth grade children of this school system. 

In the previous spring the children had been 
grouped into four classes by the elementary 
school principal who was notat liberty to change 
the grouping according toa preferred random as- 
signment in the fall. That is, the entire sample 
already had been sorted into four distinct classes. 
The principal stated that no systematic order 
was used in the assignation of the childrento 
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their respective groups. Therefore, interpre- 
tations of the finding of this study must recog- 
nize this particular manner of selection. Although 
the precision of the experiment was not impaired 
by this circumstance, one cannot know for sure 
whether the error introduced by this selection is 
an over-estimate or an under-estimate. So that 
the four subclasses might be equally represented, 
the records of six children were deleted by ran- 
dom selection. Thus there were, for purposes of 


analysis, twenty-six children in each of the four 


subgroups. The total N equaled 104. 

A study of the occupations of the fathers of the 
subjects showed that the group could be regarded 
as a select sample in a socio-economic sense. 
The occupation of each parent was compared to 
the categories listed in the Minnesota Occupa - 
tional Scale“. However, the tabulation showed 
that no difference existed between the control 
and experimental groups with respect to that 
criterion. 

The subgroups were likewise canvassed to 
discover if any differences existed with regard 
to previous instrumental music training. A tab- 
ulation of the number of children who had played 
instruments prior to the time of the final testing 
and the number of months of this instruction re- 
vealed no differences between the subclasses. 

The tests used in this experiment to measure 
changes in musical behavior oi the subjects are 
an integral part of the experimental design. For 
this reason the writer calls attention to the se- 
lection and construction of these devices. 


Measuring Instruments Used 


A major problem facing the investigator was 
to obtain tests to measure achievement in music 
skills, specifically those connected with reading 
and understanding music notation. Existing mu- 
sic achievement tests were considered unsuit - 
able for various reasons. The tests to be used 
had to meet certain specifications in order that 
results obtained from them would merit justifi- 
able conclusions. 

The instruments used should not give biased 
results because of item content nor the manner 
of their construction. It would seem extremely 
important to use tests which would not give un- 
fair advantage to students trained in instrumen- 
tal music. Tests utilizing content at the fourth 
and fifth grade level primarily should also be a 
more valid test of the subjects’ learned skills. 

The author decided to construct an achieve- 
ment test of music reading skills to overcome 
difficulties inherent in those already on the mar- 
ket. The test was divided in two parts called 
Test 1 and Test 2. Test 1 was designed specif- 
ically to measure the pupils’ ability to recall 
notation and terminology used at the fourth and 
fifth grade level. It also measures the subjects’ 
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ability to apply this knowledge to new situations. 
Test 2, while it demands the same information 
as Test 1, was constructed to measure the chil- 
drens’ ability to match audio stimuli to visual 
stimuli. 

Courses of study in music at the elementary 
level filed at the curriculum laboratory of the 
University of Minnesota were studied 
and materials and objectives of the music cur- 
riculum at the Edina elementary school were 
examined. These sources aided the author to 
determine the scope of the test. Preliminary 
forms of the two tests were drawn up and sub- 
mitted to three experts in the field of music ed- 
ucation, one of whom was the elementary music 
supervisor of the Edina system. It was through 
these steps that the investigator established the 
content validity of the instruments. 

Estimates of the reliability of both Test land 
Test 2 were made utilizing Hoyt’s technique by 
means of the analysis of variance?. Reliability 
coefficients on the final testing for Test 1 ranged 
from .89 to .95 in the subgroups. For Test 2, 
the coefficients ranged from .72 to.81. Both 
tests were found to measure sufficiently accur- 
ately to differentiate among individuals. 

It would be of considerable value to discover 
if the effect of instrumental music instruction 
influenced the subjects’ preference for various 
categories of music. To test this, the author 
selected the Keston Music Preference Test 
This instrument was devised by Professor Morton 
J. Keston ofthe Department of Psychology at the 
University of New Mexico as a means of measur- 
ing musical preferences of senior high school 
students in a study he conducted in 1946-1947. 

The test is composed of thirty items; each 
item consists of excerpts from four orchestral 
compositions. Because each of the excerpts is 
approximately 45 seconds in length, all four fit 
on one side of a 12-inch disc. The groupings 
are arranged so that the same mood is expressed 
in each of the four excerpts. Due to this arrange- 
ment, it is possible to say that one is notattempt- 
ing to measure the preferences of the subjects 
with respect to mood. 

The subject is instructed to rank the four ex- 
cerpts of each item in order of his preference. 
The four categories are (a) severely classical, 
(b) serious popular classical, (c) easy popular 
classical, (d) popular. 

Keston established the validity of his test by 
administering it to twelve expert musicians and 
to students rated by their teachers as musical. 
In addition, he compared his categories of ex- 
cerpts with lists of the selections in phonograph 
record catalogues. 

Circumstances in the present study did not 
permit an estimate of the reliability of the test 
by means of a test-re-test, which would appear 
to be preferable. Keston established that, with 
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his sample, the test was highly reliable. In the 
present study, a split-halves test of reliability 
showed coefficients ranging from . 88 to .97 in 
the subclasses (final testing). Keston selected a 
portion of his sample to discover their consist - 
ency in making preferences on this test. He pre- 
sented the four excerpts of each item in the six 
possible paired comparisons. By means of the 
‘“‘circular triad’’5 criterion, he was able to dem- 
onstrate a high degree of consistency in their 
choices. 


Other Measures 


1.Q. - This measure was obtained from the 
pupils’ individual school record folders; as men- 
tioned earlier, the Otis Test (Alpha) was the in- 
strument used. In most cases, the tests had been 
administered to the children when they were in 
the third grade. 

Grade point average - To obtain a measure of 
school achievement, the writer procuredthe of- 
fice record cards for each child and calculat ed 
the grade point average for each. Only the letter 
grades earned during the year of this study were 
used. 

Reading comprehension - These scores were 
secured from the classroom teachers’ records 
immediately after the spring testing program. 
The reading comprehension test is a subtest 


of the lowa Every-Pupil Test of Basic Skills, 
Elementary Battery, Form L®, 











Procedure of the Study 


Both the control and experimental subgroups 
held music class daily for the entire school year. 
In order to minimize the effect of the instructor, 
the investigator taught all four subgroups from 
September 3, 1952, when school began 
in the fall, until May 22, 1953, the close of the 
school year. Because of the physical environ- 
ment in which the experiment was staged it was 
not possible to differentiate the control and ex - 
perimental treatments until November 17. Thus 
the period under investigation extended actually 
for twenty-five weeks. The class period in all 
groups was one-half hour long. 

The control subgroups followed the course of 
study in general use throughout the elementary 
system. The great majority of time was used in 
Singing with a good deal of emphasis placed on 
reading musical notation. The solfeggio system 
and the movable ‘‘do’’ were used in sight singing 
almost exclusively. 

The only difference between the experimental 
and control groups’ course of study was the in- 
troduction of instrumental instruction. During 
the 25-week period, exactly half of the class 
time devoted to music was spent in this manner. 
The division of time for the experimental treat- 
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ment was arranged so that alternate days were 
spent for instrumental and vocal music training. 
The unit of organization in both types of instruc- 
tion was the entire class. 

Materials for the singing classes in both the 
experimental and control groups were furnished 
by the school. The American Singer series? 
and the World of Music series® supplied the 
author with the primary source of song material. 
Nearly all of the materials for the instrumental 
program were furnished by the writer. Instru- 
ments were rented for the childrens’ use from 
a local music store. The investigator also fur- 
nished the children racks, reeds, instrument re- 
pair material and method books. The method 
book primarily used was The Boosey and Hawkes 
Band Method9. 





The Design of the Experiment 


In order to obtain as much information as pos- 
sible in the natural setting of the public school, 
the author decided to use a 2 X 2 factorial design. 
Not only would it be of value to study the effect of 
the experimental treatment, but also the effect of 
the treatment at different levels. Thus, by es- 
tablishing control and experimental groups at 
both the fourth and fifth grade levels, the study 
yieided more pertinent information. The invest- 
igator felt that the efficiency of the experiment 
was correspondingly increased and the results 
more widely applicable. 

The assignment of the instructional method 
to the two groups at each level was decided ran- 
domly. Each class was numbered either ‘‘one’’ 
or ‘‘two’’. By drawing a number from a table 
of random numbers, it was decided whether the 
even- or odd-numbered class should receive the 
experimental treatment. 

The measuring instruments were applied at 
the beginning of the experiment and again at the 
end to determine what differences might have 
been effected by the introduction of instrumental 
instruction. The analysis of variance and co- 
variance provided the means of testing the sig- 
nificance of the differences. 


Analysis of the Data 


The analysis of variance and covariance was 
a useful test in this particular experiment be- 
cause it revealed the significance of a number of 
components simultaneously. Furthermore, this 
tool increased the precision of the analysis by 
making adjustments in the outcomes which may 
have been due to the inequalities in initial abil- 
ities or achievement. 

Choice of independent variates - Data on six 
variables had been collected, namely, the pre- 
tests of the two music achievement tests and the 
Keston Test, I.Q., grade point average and read- 
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ing comprehension. It seemed desirable to de- 
termine whether or not intellectual capacity, 
achievement and reading comprehension showed 
significant relationships with the criteria. Clear- 
ly, it would be undesirable to assume that such 
factors were unrelated to the criteria without 
first investigating the relationships. 

All the correlations between each of the inde- 
pendent variates (reading, school achievement, 
1.Q. andthe three pretests) and the correlations 
between them and the dependent variates (the cri- 
terion measures) were computed. Thus, for each 
of the dependent variates, Test 1 final, Test2 
final and the final test of the Keston, six systems 
of simultaneous equations were solvedfor the 
four subgroups. The method used in this study 
to solve for the unknown quantities was the well- 
known Doolittle method!®, 

The standard partial regression coefficient 
(beta) for each variate was then computed for each 
subgroup on each of the dependent variates. To 
test the significance of the beta’s it was neces - 
Sary to obtain the ratio value of the beta to its 
standard error in each case. Since these values 
were distributed as ‘‘t’’, the significance of the 
beta’s was determined by referring to the ‘‘t’’ 
table with nineteen degrees of freedom. 

The standard partial regression coefficients 
of the pretests of the three dependent variates 
were the only significant contributors to the 
criteria. Nosignificant relationship witha criter- 
ion variable was found for intelligence, achieve- 
ment, or reading comprehension in any subgroup 
at the one percent level. 

The next step in the analysis consisted of the 
application of the analysis of variance and covar- 
iance to test the hypothesis that there was nodif- 
ference between the experimental and control 
groups on each of the three criterion variables. 
Since the preceding analysis revealed that the 
pretests were significantly related to the criteria, 


only these variables were controlled by covariance. 


The assumption of equal variance on all three 
of the tests for the four subgroups was met by 
applying the L, criterion!!, The method used to 
test the assumption of a normal distribution on 
each of the measures was the probit testl2. It 
was found that there were no significant de via - 
tions from that distribution. Table I presents 
the means and standard deviations on the pretests 
and final tests of the four subclasses for all three 
variables. 

Knowledge of musical notation - Table II gives 
the analysis of variance and covariance table, 
with respect to Test 1 data, for interaction be- 
tween method and grade. Since the exper imen- 
tal treatment was introduced at both the fourth 
and fifth grade levels, this interaction effect must 
be tested first. Note that there is a significant 
interaction. Due to this effect, the variation at- 
tributed to the interaction was not pooled with 
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the residual error in testing the null hypothesis 
that there is no significant difference in the 
means. 

The adjustment of the sums of squares totest 
the hypothesis of no difference between the means 
of the experimental and control groups on Test 1 
is indicated in Table Ill. The tabled value of Fat 
the one percent level when n, = 1 and nz = 99 is 
6.90. The value of F obtained for ‘‘between 
methods’’ in this case is 27.38. Thus the exper- 
imental classes achieved significantly higher 
than the controls with respect to skills measured 
by Test 1. 

Because it was shown that a significant inter- 
action existed between the main effects of treat- 
ment and level, it was of interest to discover 
whether or not there was a significant difference 
in adjusted means for individual pairs ofsub- 
classes. A method of testing this is described 
by Lindquist!3 and Mood!4, 

The criterion means for each of the groups 
were adjusted by the following formula: 


Yj = Yj - by (Kj - Mx) 
where 


¥;' (i = 1,2) is the adjusted criterion mean 
for the ith group. 


¥; (i = 1,2) is the criterion mean for the 
ith group. 


by is the ‘‘within subclasses’’ regression 
coefficient. 


Xj (i = 1,2) is the mean of the control vari- 
able for the ith group. 


M, is the grand mean of the control variable. 


The error variance of the difference between 
the two adjusted criterion means is given by 
i oo OR A 
MS, jn, * 1,* — = 
1 2 (x) 


where 


MSy is the adjusted mean square for ‘‘with- 
in subclasses’’. 


x) is the sum of squares for ‘‘withinsub- 
classes for the control variable. 


n, is the number of individuals in the first 
group. 


n, is the number of individuals in the second 
group. 


The value of ‘‘t’’ therefore, is equal to 
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TABLE I 


MEANS AND STANDARD DEVIATIONS ON THE 
CRITERION VARIABLES BY SUBGROUPS 








Test 1 











Subgroup Pretest Final 
x S.D. x S.D. 
Fourth grade control 23. 69 8. 86 26.15 9. 87 
Fourth grade experimental 20. 73 10. 58 27.27 12. 05 
Fifth grade control 23. 83 8. 96 22. 96 9.12 
Fifth grade experimental 24. 34 9. 67 34.50 8. 44 

Test 2 

Pretest Final 
x S.D. x S.D. 
Fourth grade control 17. 46 12.62 24. 27 11.26 
Fourth grade experimental 19. 69 9.97 24. 81 10.37 
Fifth grade control 19. 23 9.29 20. 12 12. 97 
Fifth grade experimental 19.12 12.21 26. 85 11. 45 

Keston Test * 

Pretest Final 
x S.D. x S.D. 
Fourth grade control 99. 04 24.77 104. 50 27.70 
Fourth grade experimental 110. 77 17.99 103. 54 24. 77 
Fifth grade control 108. 04 21.80 116. 15 30.74 
Fifth grade experimental 95.15 26. 72 89.15 29.93 





* ‘*Poor’’ performance on this test is indicated by a high score 
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with degrees of freedom km - k - 1 where k 
equals the number of treatments or levels, and 
m is the number of observations in eachcell. 

Since the difference between any two of the 
four subclasses represents just one of six pos- 
sible comparisons, the probability level re - 
quired for significance is not set at 1 in 100 but 
1 X 6 in 100, or .0017. 4 situation of this type 
is described by Johnson!°, These tests were 
run on each of the six possible comparisons. 
The experimental group proved to be signifi - 
cantly different from the control group at the 
fifth grade level, but not at the fourth grade. 

Matching of audio and visual stimuli - The 
test for interaction on data from Test 2 yielded 
an F of 5.0 so that the null hypothesis of no in- 
teraction between treatment and level remained 
in doubt (.05> P>.01). However, the variation 
due to the interaction effect was not pooled with 
the residual error to test the hypothesis of no 
difference in m2ans,. 

The analysis further revealed there was no 
significant difference between the means of the 
experimental and the control treatments with re- 
spect to the skills measured by this test (F = 
2.83). Due to the fact that it was not reason - 
able to assume there was no interaction between 
the main effects of treatment and level, it was 
advisable to proceed in the same manner as be- 
fore to test between individual subgroups. 

No significant differences were discovered 
when the subclasses were compared witheach 
other. It should be noted that the value of ‘‘t’’ 
was found to be 2.76 when the experimentaland 
control groups at the fifth grade level were test- 
ed. The hypothesis of no difference between 
their adjusted means must, therefore, remain 
in doubt (.01> P»>.0017). 

Music preference - The amlysis and covar- 
iance showed that it was possible to assume that 
the main effects of treatment and level with re- 
spect to the scores on the Keston Test were in- 
dependent of each other. Since the variationdue 
to interaction was not significant, it was possi- 
ble to pool it with the variation already identi - 
fied as residual. 

The analysis further showed that the test dis- 
criminated significantly between the experimen- 
tal and control classes. F was computed to be 
13.62 (P <.01). Thus, it was concluded that, 
within the limitations of the experiment, the com- 
bined approach of vocal and instrumental music 
training was superior té vocal alone when music 
preference as measured by this test was usedas 
a criterion. The analysis showed this to be true 
at the fourth and fifth grade levels so it was not 














necessary to proceed beyond this pointas was 
true in the other two instances. 


Other Data 


In order to reveal trends in the attitudes of 
the children’s parents as to the value of combined 
vocal and instrumental training, a questionnaire 
canvass was made. Each set of parents of the 
children in both experimental control subclasses 
was requested to answer a twelve-item question- 
naire when the experiment was nearly completed. 
A one hundred percent response was obtained 
since the questionnaires were taken home andre- 
turned by the children. 

The response totals for each item were tabu- 
lated and then tested by the chi-square technique 
and significance accepted at the five percent level. 
Not all the items discriminated between the ‘‘ex~ 
perimental parents’’ and the ‘‘control parents’’ 
and there was some variation between the two 
levels. However, it was apparent that the par- 
ents of both fourth and fifth grade pupils accepted 
the program. 

More of the parents of the children in the ex- 
perimental groups felt their children. had been 
stimulated to continue the study of their music 
than those of the controls. Another significant 
finding was that the experimental parents felt 
that the introduction of instrumental music to the 
curriculum would not tend to place too much em- 
phasis on music in relation to the rest of the 
school program. The control parents’ responses 
were significantly different at the five percent 
level on both of these points. Thus, ingeneral, 
it appeared safe to assume that the parents of the 
children who had participated in the experimen- 
tal course of study favored it. 

Toward the end of the school term, the chil- 
dren in each of the four subclasses were givenan 
alphabetized list of all the subjects tncluded in 
their curriculum. The classroom teacher inevery 
case administered the questionnaire and the chil- 
dren had no indication that the main object of the 
device was to discover the ranking givento music. 
The instruction given to the pupils by the class- 
room teacher was that the children rank their 
subjects in the order in which they liked them. 

To be sure, the information thus received 
was not an absolute index of their preference for 
music. It was, rather, a relative measure be - 
cause the individuals were forced to compare mu- 
sic with the other subjects in the curriculum, 
However, this method seemed preferable to the 
more commonly-used rating scale device, par- 
ticularly at these grade levels. The chi-square 
test was used for determining the significance 
of the difference in the distribution of ranks 
assigned. A significant difference in favor of 
the experimental class was found at the fifth 
grade level (P <.001) but not at the fourth. 
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Implications 


The fifth grade experimental class attained 
Significantly higher scores on Test land the 
Keston Test than the fifth grade control group 
and the null hypothesis remainedin doubt on 
Test 2. The pupils in this class also ranked mu- 
sic significantly higher on the subject preference 
list than the other fifth grade. The fourth grade 
experimental subjects did not do as well, although 
Table I shows the same trend in both the fourth 
and fifth grades. 

These findings would seem to suggest that it 
might not be advisable to introduce a combined 
vocal-instrumental curriculum until the fifth 
grade. At least, the criterion of achievement in 
reading and understanding music notation indi - 
cates such a conclusion. On the other hand, an- 
alysis of music preference scores showed that 
the experimental method produced effective re - 
Sults in both fourth and fifth grade experimental 
classes. Thus it is possible that fourth grade 
children are learning skills and attitudes not 
measured in this experiment. Further studyand 
the use of other criteria would help supervisors 
to make such decisions if a program of this type 
were to be adopted. 

The analysis of the data collected strongly 
indicates that the integration of both vocal and 
instrumental instruction would tend to develop 
within the pupils a broader base of musical 
comprehension than can be obtained through vocal 
training alone. Responses to questionnaire de- 
vices lead one to believe that such a programis 
more interesting for children and that parents 
would be satisfied with the arrangement. 

However, it would be of great interest to 
know whether or not such a course of study would 
have lasting effects. For instance, would it be 
advisable to extend this instruction over a longer 
period of time? It may be that the desired out- 
comes of a music education are of such a com - 
plex nature that a one-year combined vocal-in- 
strumental program might not be sufficient. 
Again, the effects of this dual instruction may 
not be revealed until later years so that the find- 
ings of this study do not give a true picture of the 
changes in the subjects. 

To answer some of these questions, the writer 
has collected data on all of the measures on the 
same pupils used in this study one school year 
after the experiment ended. Analysis of these 
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data will be of value because the results will be 
more widely applicable for the music educator. 
The findings of this follow-up study will be pub- 
lished in the near future. 


10. 
ll. 
12. 
13. 


14. 


15. 
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A COMPARISON OF VERBAL AND PICTORIAL 
SELF-RATING-SCALE CATEGORIES 


JULIAN C. STANLEY* 
University of Wisconsin 


DO VERBAL reference points ona self- 
rating scale result in different marking from 
pictorial ones? 

Chester E. Evans1** has devised two forms 
of a ‘‘personal satisfaction’’ inventory which 
are virtually identical except that the five cate- 
gories of response on one are words (very satis- 
fied, satisfied, neutral, dissatisfied, very dis- 
Satisfied), while on the other they are five smil- 
ing, neutral, or scowling facial expressions of 
aman. Sample items from the inventory are: 
the city in which you live, your last boss, our 
foreign policy. 2 


Design of the Study 





In order to ascertain whether or not the two 
sets of reference point anchors have different 
effects, the writer administered inventories 
twice to each of 48 college students of bothsexes 
in two required courses, as outlined in Table I. 
In the ‘‘child psychology’’ (CP) class, 24 indi- 
viduals were assigned at random to alternate 
seats in four rows. All six persons in the first 
row filled out the ‘‘words’’ inventory, followed 
after a 15-minute period of irrelevant activity 
by the ‘‘faces’’ inventory. Those in the third 
row (even-numbered rows were left vacant) re- 
ceived ‘‘faces’’ and then ‘‘words.’’ The fifth 
row had ‘‘words’’ followed by ‘‘words, ’’ and the 
seventh ‘‘faces’’ followed by ‘‘faces.’’ Proced- 
ure was similar in the ‘‘educational psychology”’ 
(EP) class, except that because of shorter rows, 
assignments were made to alternate seats in 
eight rows. Rows 1 and 5 received inventories 
in words-faces order, 2 and 6 faces-words, 3 
and 7 words-words, and 4 and 8 faces-faces. 
More persons than are shown in Table I were 
tested, but enough were eliminated randomly to 
leave six in each class-design-order group for 
the statistical analysis. 

Few students realized that any order other 
than the one they experienced was being used. 
The second inventories were not mentioned until 
15 minutes after the first ones were turned in, 
during which interval regular class activities 
went on. The EP class ended at 11:50 a.m. and 
the other began at 1:20 p.m.the same day; ap- 





parently, no information concerning the experi- 
ment leaked out during this 90 minutes. 

Satisfaction inventory scores, the dependent 
variable, consisted of mean item scores expres-~ 
sed ona 0 (very dissatisfied or heavy scowl) to 
4 (very satisfied or broad smile) scale, multi- 
plied by 100 to eliminate decimals. Since a few 
of the 50 items did not apply to many of these stu- 
dents (e.g., ‘‘your husband or wife’’), the divisor 
was usually slightly less than 50. The 96 three- 
digit scores ranged from 178 (100 = dissatisfied, 
200 = neutral) to 328 (300 = satisfied), with a 
grand mean of 247. 


Four Ways to Analyze the Scores 





The Table I scores may be analyzed inat least 
10 partially different ways, but we need be con- 
cerned with only four of these.3 Our chief inter- 
est is in the ‘‘anchor’’ (‘‘words’’ vs. ‘‘faces’’) 
main effect. This can be tested in three differ- 
ent ways: by considering only scores from the 
first inventory; as in Table II; by treating the 
two non-crossover designs (1) together, as in 
Table II, where ‘‘anchor’’ is a between-individ- 
uals main effect; and by treating the two cross-~- 
over designs (2,3, 4,5,6,7) together, asin Table 
IV, where ‘‘anchor”’ is a within-individuals main 
effect. 

The first of these, in Table II, is the most 
powerful, having 44 d.f. for the ‘‘error’’ mean 
square, and it also provides a statistical tes,t 
for the ‘‘class’’ main effect and the ‘‘class X 
anchor’’ interaction. 

The pooled non-crossover design analyzed in 
Table III tests the same three sources of varia- 
tion as the Table II procedure but with fust 20 
d.f, for ‘‘error’’; in addition, it tests the ‘‘se- 
quence’”’ main effect and the three interactions 
of ‘‘sequence’’ with ‘‘class’’ and ‘‘anchor. ”’ 

The Table IV replicated latin-square design 
involves administering both inventories to each 
student, unlike the other two methods, so that 
scores for ‘‘words”’ are correlated with those 
for ‘‘faces. ’’ 

Thus with respect to ‘‘anchor’’ differences, 
Tables I, Il, and IV answer three questions: 

1. If two random groups are tested, one with 


# The writer is indebted to Professor Palmer 0. Johnson for several helpful suggestions concerning 
the reporting of this study. 


References and footnotes will be found at end of this article. 
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TABLE I 
OVERALL DESIGN FOR COMPARING ‘“‘WORDS”’ AND ‘‘FACES’’ ANCHORS 
First Second 
Class Design Order Individual Inventory Inventory 
1 Words Faces 
Words, Faces 
(Words first) 
6 Words Faces 
Crossover 
7 Faces Words 
Faces, Words 
(Faces first) 
12 Faces Words 
**Child 
Psychology’’ 
(CP) 13 Words Words 
Words, Words 
(Words first) 
18 Words Words 
Non- 
Crossover 
19 Faces Faces 
Faces, Faces 
(Faces first) 
24 Faces Faces 
25 Words Faces 
Words, Faces 
(Words first) 
30 Words Faces 
Crossover 
31 Faces Words 
Faces, Words 
(Faces first) 
‘‘Educational 36 Faces Words 
Psychology’’ 
(EP) 
37 Words Words 
Words, Words 
(Words first) 
42 Words Words 
Non- 
Crossover 
43 Faces Faces 
Faces, Faces 
(Faces first) 
48 Faces Faces 
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TABLE I 


ANALYSIS OF VARIANCE FOR THE 48 INVENTORIES ADMINISTERED FIRST 











Source of Variation Sum of Squares d. f. Mean Square F 
Between classes 4, 408. 33 1 4, 408. 33 5. 16* 
Between anchors 33. 33 1 33. 33 
Class x anchor 456. 33 1 456. 33 
Within class-anchor groups** 37,597. 00 44 854. 48 

Total 42, 495. 00 47 





* Significant beyond the . 05 level. 


**Bartlett’s test yields a corrected chi square of 12.71 with 3 d.f., which is significant beyond 
the .01 level, for the four class-anchor-group variance estimates whose sums of squares are 
pooled to yield the error term. The estimates are 486, 305, 512, and 2114. 


TABLE Il 


ANALYSIS OF VARIANCE OF THE TWO NON-CROSSOVER DESIGNS TOGETHER 











Source of Variation d. f. Mean Square F* 
Class 1 6, 302. 08 3.33 
Anchor 1 2,214. 08 1.17 
Class X anchor 1 1,776. 33 
Individuals within class-anchor groups 20 1, 894. 98 118. 50** 
Sequence 1 2. 08 
Class X anchor X sequence 1 36.75 2.30 
Class X sequence 1 0 
Anchor X sequence 1 0. 33 
Individual x sequence within class- 

anchor groups 20 15. 99 

Total 47 





* Computed from mean squares carried to four decimal places. 


**Significant beyond the .001 level. 
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TABLE IV 
ANALYSIS OF VARIANCE OF THE TWO CROSSOVER DESIGNS TOGETHER 





Source of Variation 5s Mean Square 





Class 3, 605. 33 


Order 2,106.75 
Class x order 33 


Individuals within class-order groups 


Sequence 

Order X sequence (anchor) 
Class X sequence 

Class X anchor 


Individual x sequence within class- 
order groups 





Total 





*Significant beyond the .001 level. 
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TABLE V 


OVERALL ANALYSIS OF VARIANCE OF SCORES IN TABLE I DESIGN 





Source of Variation os Mean Square 





Between individuals 


Class 


Design 


Order (W first vs. F first) 
Class x design X order 
Class x design 

Class x order 

Design X order 


Individuals within CDO groups 


‘‘Within’’ individuals 
Sequence 
Class x design X order X sequence 
Class x design X sequence 
Class X order X sequence ' 4. 11*** 
Design X order X sequence > 1.34 
Class X sequence 
Design X sequence 
Order x sequence 


Individual x sequence within CDO groups 





Total 





* Significant beyond the . 025 level. 
** Significant beyond the . 001 level. 
***Significant at the . 05 level. 
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TABLE VI 


VARIANCE ESTIMATES THAT WERE POOLED TO YIELD THE TWO 
ERROR TERMS IN TABLE V 





Mean Square 
Class-Design-Order 











Group d. f. Between Individuals ixs 
CP, Crossover, WF 5 981.53 50. 93 
CP, Crossover, FW 5 329. 35 107. 88 
CP, Non-C, WW 5 1, 074. 28 18.95 
CP, Non-C, FF 5 576. 80 4.93 
EP, Crossover, WF 5 1,308.33 21.93 
EP, Crossover, FW 5 3,975. 48 9.88 
EP, Non-C, WW 5 624. 13 12. 60 
EP, Non-C, FF 5 5, 304. 68 27. 48 
Corrected chi square (Bartlett) 15. 27 15.38 





Chi square at . 05 level (7 d.f. ) 14. 07 14. 07 
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‘‘words’’ and the other with ‘‘faces,’’ will their 
mean scores differ significantly? 

2. If two random groups are tested twice 
each, one with ‘‘words’’ both times and the other 
with ‘‘faces,’’ will the means of their two sum- 
med scores differ significantly ? 

3. If two random groups are tested, one with 
‘*words’’ followed by ‘‘faces’’ and the other with 
‘‘faces’’ followed by ‘‘words, ’’ will the mean for 
‘‘words’’ differ significantly from that for ‘‘faces ?’’ 

Usually we are not trying to answer the sec- 
ond question. The first and third questions are 
essentially equivalent, so it would seem most 
suitable with a fixed number of subjects to em- 
ploy only the crossover design for all members 
in each class, thus reducing the ‘‘error’’ term 
for the anchor main effect by taking into account 
the high correlation between ‘‘words’’ and 
‘‘faces, ’’’ besides providing as many degrees of 
freedom for ‘‘error’’ as the Table II analysis. 
The crossover design requires retesting each 
subject and a considerable degree of control, 
but it yields correlational information not avail- 
able when each person is tested only once. Al- 
so, it will often be more efficient, both statis- 
tically and financially, to secure 96 scores from 
48 examinees than 96 scores from 96 examinees. 

However, the Table I mixture of cr ossover 
and non-crossover designs permits us to test dif- 
ferences among the eight r’s andthereby to see 
whether the ‘‘unalikeanchor”’ reliability coeffic- 
ients differ significantly from the ‘‘test-retest”’ 
coefficients. They should not if the two forms 
are to be regarded as interchangeable. 

The Table I design cannot be analyzed as a 
whole without losing the ‘‘anchor”’ classifica- 
tion by dealing with ‘‘order’’ (‘‘words inventory 
taken first’”’ vs. ‘‘faces inventory taken first’’), 
as shown in Table V. 


Results 


‘‘Words’’ and ‘‘faces’’ anchors do not differ 
significantly from each other in any of the anal- 
yses (Tables II, I, and IV), nor does ‘‘anchor”’ 
interact significantly with class or sequence. In 
Tables II and V the classes differ significantly 
beyond the .05 level, and in Table V the inter- 
action of class, order, and sequence appears to 
be significant. However, because of the large 
number of significance tests run (15 main ones 
in Table V) and the heterogeneous ‘‘error’’ var- 
iance shown in Table VI, this apparently signif- 
icant C x O x S mean square may be attributable 
to chance; if real, its meaning is not clear. 

The eight Pearsonian r’s, each based upon 
six pairs of scores, range from .55 for the CP- 
crossover~faces first group to . 9961 for the EP- 
crossover~faces first group. The Fisher-z 
equivalents of these r’s have an estimated popu- 
lation variance of 0.6182, compared with the 
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theoretical variance of 0.3333 for random-sam~- 
ple z’s based upon six pairs. The ratio of 0.6182 
to 0.3333 is a non-significant F of 1.85 with 7 
and infinite degrees of freedom.4 Thus the eight 
r’s do not differ significantly from each other. 


Discussion 


The two chief findings from the analyses out- 
lined above are high correlation between the in- 
ventories and somewhat greater personal satis- 
faction in the EP class thanin CP. Perhaps the 
CP class members, chiefly first-semester jun- 
iors enrolled for their initial course in the School 
of Education, were under more tension than the 
first-semester seniors in the EP class, who had 
already been self-screened for malcontents. Sev- 
eral alternative hypotheses are equally plausible, 
of course. 

‘‘Words’’ and ‘‘faces’’ means do not differ 
significantly for the college students employed 
as subjects in this study.% Scores on the two in- 
ventories have similar variability and appear to 
intercorrelate highly. Obviously, it is possible 
that other verbal categories would produce reac~- 
tions different from the faces, since the word 
‘‘very’’ may be quite indeterminate. It should 
be interesting to compare various rating-scale 
markers, such as the numbers +2 through -2 or 
pips on a line, with the word and face anchors. 

Although the satisfaction inventories origin- 
ated in industrial situations, results of this study 
should not be over-generalized. There may be 
marked differences between the reactions of ad- 
vanced college students and industrial workers. 
Some of the designs discussed here should be 
Suitable for experimentation with various non- 
college groups. 


Summary 


Using with 48 college students an experiment- 
al design involving both crossover and non- 
crossover principles, the writer shows by four 
analyses of variance that two ‘‘personal satisfac- 
tion’’ inventories devised by Chester E. Evans 
which have words as rating-scale reference points 
in one and faces in the otheryield equivalent 
mean scores, differentiate between two college 
classes, and intercorrelate approximately . 96. 
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FOOTNOTES 


lEmploye Research Section, General Mo- 
tors Corporation, Detroit 2, Michigan. 

Twenty-seven of the 50 items were origin- 
ated by Joseph Weitz (8). Two other sources of 
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information concerning the history and develop- 
ment of the satisfaction inventories are Tech- 
nical Bulletins No. 10 and 16, Employe Research 
Section, General Motors Corporation: Theodore 
Kunin, ‘‘A Study in the Placement of Faces Along 
Subjective Continua,’’ June, 1950, and ‘‘The 
Weitz-Evans Community Checklist, ’’ January, 
1954. Kunin’s report, an M.A. thesis at West- 
ern Reserve University, is scheduled for con- 
densed publication in a forthcoming issue of Per- 
sonnel Psychology. The present writer is in- 
debted to Dr. Evans for furnishing the invento- 
ries and calling his attention to their antecedents. 

3The composite design of Table I involves 
two crossover and two non-crossover designs, 
each of which four might be subjected to an anal- 
ysis of variance separately, and two classes, 
which might also be analyzed apart from each 
other. The writer has done these six analyses 
for his own information and will be glad to cor- 
respond with interested persons concerning them, 
but they are not discussed in this article. 

his procedure is equivalent to the chi- 

square test outlined by Edwards (3: 135). 

5Since no population was defined and random 
samples taken from it, conclusions must be lim- 
ited to these 48 subjects or to some hypothetical 
population for which they might be conceived to 
constitute a random sample. 








THE INTERPRETATION OF THE STANDARD 
ERROR OF MEASUREMENT 


C. H. PATTERSON 
Veterans Administration Regional Office 
St. Paul, Minnesota 


THE USE of the standard error of meas- 
urement as a measure of the reliability of test 
scores has been recommended by some test ex- 
perts (18:222). Since the standard error of 
measurement applies to a single score, and is 
stated in terms of test score units, it is much 
more useful in counseling individuals than the 
reliability coefficient. Although Kelley pointed 
out in 1921 (13) that the standard error of meas- 
urement is relatively independent of the range 
of scores (or ability), a factor which has con- 
siderable effect on coefficients of re liability, 
this apparently is true only under certain con- 
ditions (17, 20,21). 

Recently the writer, in discussing the relia- 
bility of the Wechsler-Bellevue Intelligence 
Scales (23), had occasion to refer to the stand- 
ard error of measurement of an IQ obtained 
with this instrument. Reference was made to 
several standard statistical texts. It was dis- 
covered that many texts make no reference to 
the standard error of measurement (3, 25, 26, 
28,30), while others give no definition or inter- 
pretation of it. It is interesting that Cronbach’s 
recent book on psychological testing (1) makes 
no mention of this measure of reliability. In 
the recent volume on educational measurement, 
edited by E. F. Lindquist, R. L. Thorndike, in 
his excellent chapter on reliability (27), refers 
to this statistic, but does not discuss its mean- 
ing or use. 

In view of the lack of information and the ap- 
parent disagreement of interpretation, the writ- 
er felt that other test users and counselors 
would be interested in a discussion of the mean- 
ing and interpretation of this measure of relia- 
bility. The controversy between Guilford and 
Garrett (5,6, 9) apparently has not resolved the 
issues. Although it appeared that Garrett 
agreed fundamentally with Guilford, he persist- 
ed in maintaining his disagreement, and in as- 
serting the equivalence of two different inter- 
pretations. 

The standard error of measurement, vari- 
ously designated as Oe, 9, q, OF 9%; qm, Tw, 
Or Op is given as 


0, =8,V1-fiz (1) 
where's, is the standard deviation of the distri- 


bution of test scores, and r,, is the reliability 
coefficient of the test, preferably based on the 





correlation of alternate (equivalent) forms. 
This formula gives the standard error ofanac- 
tual observed score; the standard error of meas- 
urement is the standard deviation of the distri- 
bution of the scores an individual would make 
on a large number of testings with equivalent 
forms. The meanof this distribution of obtained 
scores would be the individual’s ‘‘true score’’ 
on the test. 

A number of the various interpretations of 
this measure will be quoted and discussed. 

1. One type of interpretation consists of state- 
ments concerning the true score of individuals 
obtaining a specified observed score. These 
statements are in the form of the probability of 
the true score falling within specified limits of 
the obtained score. It is stated, usually, that 
the true score of an individual making a speci- 
field score is almost certain to lie within three 
standard errors of the obtained score, or, that 
the chances are approximately 68 out of 100 that 
the true score lies within + one standard error 
of the obtained score. Peatman, for example, 
referring to the confidence limits of an IQ of 109, 
given a standard error of 4.5, states: ‘‘... thus 
it is likely that persons with an IQ score of 109 
have parameter (i.e., true) scores whose val- 
ues lie between 100. 0 and 118.0; and we can be 
quite confident that the parameter values will 
not be less than 95.5 or greater than 122.5’’ 
(24:379). While not an exact probability state- 
ment, this appears to be of the form stated 
above. 

2. Guilford (8:414) points out that such state- 
ments imply the prediction of the true score 
from the observed score. This would require 
the standard error of estimate of a true score. 
In another place (10:480) he states: ‘‘Any ob- 
tained score does not tell us what the cor res- 
ponding true score is, but with knowledge of 
0,q@ we can have a degree of confidence that 
the true score cannot be very far away.’’ This 
implies that no exact statement concerning the 
true score can be made from an observedscore. 
As Lindquist points out (18:221 footnote), the 
standard error of measurement is the standard 
error of estimating an observed score ona test 
from the corresponding true score. Thus the 
proper interpretation of the standard error of 
measurement is in terms of the distribution of 
obtained scores for a given true score. Guil- 
ford (8:414) gives as the correct type of state- 
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ment: ‘‘If a certain individual has a true score 
of 50 points in the test, then we may expect two- 
thirds of all his actual scores to lie between 
46 and 54’’(given a 0, of 4). Lindquistagrees 
with this interpretation (18:221-222). 

Variations of this type of interpretation are 
in terms of the probabilities of any observed 
score falling within certain limits of the true 
score, or the probabilities of the obtainedscore 
differing from the true score by a given amount, 
say, one or more standard errors, or probable 
errors. For example, Kelley (15:79) writes: 
‘*,,,.the chances are fifty in one hundred that 
the pupil’~ obtained score differs from his true 
ability score by an amount greater than [one 
PE]. ’’ Garrett also follows this latter inter- 
pretation. He says (using a PE of 3): ‘‘This 
result may be interpreted to mean that the 
chances are even (50 in 100) that the obtained 
score of any individual in the group ...does not 
differ from his true score by more than + 3 
points”’ (7:321). 

It will be noted that these interpretations dif- 
fer from those of Guilford and Lindquist in that 
they make statements concerning an individual 
score. That is, they assign a probability to an 
individual score, rather than to a distribution 
of scores. The error of this procedure will be 
discussed later. 

Gulliksen also follows the interpretation in 
terms of obtained scores, in the manner of 
Guilford and Lindquist. He states that, ‘‘.... 
in general, no probability statements can be 
made to apply to all persons who make a given 
observed score. We can only make the state- 
ment the other way around. For allpersons 
with a given true score, the probability is over 
.997 that the observed score will lie within plus 
or minus three times the standarderror of meas- 
urement from that true score”’ (11:20). 

3. However, another interpretation appears 
possible. While it may not be permissible to 
make a probability statement regarding all per- 
sons with a given observed score, it is possible 
to make statements regarding a particular indi- 
vidual with a given observed score. There ap- 
pear to be three methods of doing this. The 
first one to be discussed is the method used by 
Kelley and others, including McNemar (24) and 
Garrett (7:320), in which the observed score is 
taken as an estimate of the true score. While 
Kelley defines the standard error of measure- 
ment as ‘‘the standard deviation of single scores 
for a given fixed value of the true score, ’’ this 
is equivalent, for him, to ‘‘the standard devia- 
tion of errors of estimate when the single score 
is taken as evidence of a true score’’ (15:171). 
On this basis, Kelley makes interpretations 
such as the following (given a score of 70 and 
a standard error of 7. 72): ‘‘There is one chance 
in six that John’s true ability lies below 62.3, 
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so that there are four chances in six thatit lies 
between 62.3 and 77.7’’ (15:172). In another 
place Kelley writes: ‘‘0,q may correctly be 

used to indicate the standard error when x, (an 
obtained score) is estimated from a knowledge 

of Xq (a true score)(a process actually never 
performed), or when xg is estimated by using 

non-regressed x,’’ (16: 403). 

These statements concerning the true score 
are similar to that of Peatman quoted before. 
However, Peatman does not specify that the ob- 
served score is taken as evidence of the true 
score. There is a difficulty in the nature of the 
statements made which will be referred to later. 
The correct form of statement regardinga true 
score on the basis of the standard error of an 
observed score which is taken as evidence of a 
true score is of the type that fora given ob- 
served score, the corresponding true score will 
fall within a given interval, determined by the 
observed score and a multiple of its standard 
error according to the confidence coefficient de- 
sired, and we shall be right inthe statement that 
the interval will cover the true score ina spec- 
ified proportion of the cases corresponding to 
the confidence coefficient selected. It should 
also be stated that the confidence interval limits 
are random variables. 

4. There is a second way inwhichstatements 
about true scores may be made. While itis true 
that the observed score is anestimate of the true 
score, it is not the true score, nor the best es- 
timate of the true score. Using anobserved 
score as the estimate of the true score and ap- 
plying the standard error to the observed score 
as if it were the error of the true score (as Kel- 
ley does above) may be confusing. It may be 
argued that the best estimate of the true score 
is one obtained as a result of prediction based 
upon the regression of the true scores upon the 
observed score, rather than accepting the ob- 
served score as the estimate of the true score. 

The standard error of estimate in predicting 
a true score from an observed score by means 
of a regression equation is given by 


Oo. = SiV1- Tyr Vlir 
(2) 


Om =S,iV Tir(l - Tir) 


where s, is the standard deviation of the ob- 
served scores and r,, is the coefficient of re- 
liability of the observed scores. Since the re- 
liability coefficient is always less than one, it 
is clear that the standard error of estimate will 
always be less than the standard error of meas- 
urement. That is, the error of a regressed 
score taken as an estimate of true score is al- 
ways less than the error of an obtained score 
taken as an estimate of a true score. 
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The usual interpretation of the error of esti- 
mate is in terms of the confidence interval in- 
cluding the true score. Thus it may be stated 
that (for additional samples) in 99 percent of the 
cases the true score falls within the interval + 
three standard errors above and below the esti- 
mated true score. 

The standard error of estimate of a regres- 
sed score is seldom used as a measure of reli- 
ability. Guilford writes: ‘‘There is nothing to 
be gained by [predicting true scores | for the 
predictions would be no more accurate than the 
scores from which they were obtained’’ (10:479). 
Kelley, however, states that ‘‘to secure an esti- 
mate of true ability, [this | process is in all 
cases the best, but if reliability is high, it is 
not sufficiently better....to warrant the extra 
labor’’ (15:181). Gulliksen agrees that ‘‘no 
practical advantage is gained from using the re- 
gression equation to estimate true scores”’ (11: 
45). 

5. No exact probability statement can be 
made concerning the group of individuals witha 
given observed score. However, reasonable 
limits may be set for the true score of an indi- 
vidual with a given observed score. Gulliksen 
describes the procedure as follows: ‘‘We may 
also indicate a method for assigning a probabil- 
ity value to the statement: ‘If a person’s ob- 
served score is 65, his true score probably lies 
between 50 and 80 (+ 39,q)).’ Note that no prob- 
ability value is given. We cannot say that for 
all persons whose observed score is 65, the 
probability is greater than .99 that the true 
score is between 50 and 80. However, consider 
the statement: ‘This person’s true score lies be- 
tween 50 and 80’. If the person’s true score 
were known, the statement could, fora given 





person, be classified as true or false.... For 
each of the persons Whose observed score is 
known, such a statement can be made.... For 


all persons in the distribution, it will be found 
that the statement regarding limits is true over 
99 percent of the time and is false less than 
three times out of a thousand. In other words, 
if all the cases were considered, a probability 
can be attached to the truth or falsity of the state- 
ment that ‘true score is included within the spec- 
ified limits.’ However, if we limit the state- 
ment to persons with any given observed score, 
no assertion regarding probability can be made. 
...We Say then that for a person whose observed 
score is 65, reasonable limits for his true score 
are 50 to 80”’ (11:17-20). 

Statements such as that of Peatman, quoted 
earlier, may be technically correct, since they 
use terms such as ‘‘likely’’ and ‘‘quite confi- 
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dent.’’ However, they are misleading in that 
they imply exactness to the reader. All that 
can be done is to set ‘‘reasonable limits’’ for 
the true score, without assigning exact proba- 
bilities. 

6. A method of assigning exact probabilities 
rather than ‘‘reasonable limits’’ to statements 
referring to a group of individuals witha given 
obtained score would appear to be possible 
through the use of tolerance limits. 1 Tolerance 
limits determine an interval which will include 
at least a specified proportion, P, of the distri- 
bution, with a specified confidence. 

Tolerance limits are of the form 


X + Ks (3) 


where X and s are the mean and the standard de- 
viation of the sample, and K is one-half of the 
tolerance interval, which may be determined by 
reference to a table in which values of Kare giv- 
en for specified values of P, of the confidence 
coefficient or level desired, and of N (4:102-107; 
also reproduced in 2, Table 16 in the Appendix). 
In the present application X would be the obtained 
score (taken as an estimate of the mean of the 
distribution of obtained scores in repeated test- 
ing, i.e., of the true score), ands is the stand- 
ard error of measurement (taken as an estimate 
of the standard-deviation of the distribution of 
obtained scores, i.e., ofan estimate of the stand- 
ard error of the obtained score taken as an esti- 
mate of the true score). In repeated testing on 
equivalent tests, the group of individuals witha 
given obtained score on the first test would ob- 
tain varying scores on the other tests, and the 
standard error of measurement would also vary 
from test to test. Tolerance limits enable one 
to state, with a specified degree of confidence, 
the interval which will include a specified pro- 
portion P of the obtained scores. 

As an example of the procedure, let us take 
an obtained score of 50, with a standard error 
of 4. It is desired that tolerance limits be set 
which will include 95 percent of the obtained 
scores in future samples, or on equivalenttests, 
with a confidence of .99. The N would be the 
number of cases entering into the determination 
of the standard error of measurement, say, 100 
in this example. K would in this case be 2.355. 
The interval would then be 50 + Kg, .,, or 40.58 
-59.42. We can thus be sure, at the . 99 confi- 
dence level, that in repeated testing of this 
sample of 100 cases (or in other samples of the 
same size), 95 percent of the obtained scores 
(or estimates of the true scores) of the group 
with an original obtained score of 50, will fall 


1. The possibility of applying tolerance limits in this situation was suggested by Dr. Palmer 0. John- 


son, University of Minnesota. 
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within this interval. That is, in 99 percent of 
the equivalent tests, 95 percent of the obtained 
scores will be included within the interval (or, 
still differently, 99 percent of the tolerance 
limits determined from repeated tests will in- 
clude 95 percent of the scores). If we desired 
to include 99 percent of the scores in 95 percent 
of the tests, the limits would be 38.26 - 51.74. 
By increasing P we can, with a corresponding 
increase in the size of the interval, includea 
greater proportion of the scores. 

Finally, it is necessary to consider the na- 
ture of probability statements regarding confi- 
dence intervals within which a given parameter 
such as a true score, lies. The true score isa 
fixed value, not a random variable. (InFisher’s 
concept of fiducial limits, the parameter is the 
same as in the case of confidence intervals.) 
Since it is fixed, it either is or is not within 
the confidence interval in a particular case, and 
the probability is either unity or zero, respec- 
tively. The probability therefore applies to the 
Statement, which will be true in the proportion 
of cases specified by the confidence coefficient, 
and not to the individual score 2 (12:111; 22:221- 
222; 29:258-261). 

It would appear to follow that any statement 
concerning an individual score must take the 
form given above. Thus, the correct statement 
concerning a given observed score would be of 
the type: ‘‘This individual’s observed score lies 
within plus and minus one standard error of his 
true score, and we shall be correct inthis state- 
ment in 68 cases out of 100, ’’ or (taking the ob- 
tained score as an estimate of the true score), 
‘‘This individual’s score lies within plus or min- 
us one standard error of his observed score, 
and we shall be correct in this statement in 68 
out of 100 cases. ”’ 

Accepting this frequency theory of probabil- 
ity, it appears possible, however, to make a 
probability statement concerning a group of in- 
dividuals, or a group of scores, when the nature 
of the statement is appropriate. (1) Asan ex- 
ample, ‘‘for the group of persons whose true 
score is 50, over 99 percent of the observed 
scores will be between three standard errors 
above and below the true score’”’ (11:17). (2) In 
the case of true scores estimated by means of 
the regression formula, the correct statement 
would be that for all cases with the same esti- 
mated true score, the actual true score falls 
within three standard errors of the estimated 
true score over 99 percent of the time. (3) For 
a group of individuals with a given obtained 
score, in repeated testing the obtained scores 
will almost certainly fall within three stand- 
ard errors (‘‘reasonable limits’’) of the cor- 
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responding true score. (4) For a group of indi- 
viduals with a given observed score, in repeat- 
ed testing a specified proportion of the obtained 
scores will fall within given tolerance limits 
with a specified probability, i.e., in a specified 
proportion of the samples of equivalent tests. 


Summary 


A survey of a number of standard texts in 
statistics indicated some disagreement in the in- 
terpretation of the standard error of measure- 
ment. Representative interpretations were quot- 
ed and classified under several types. Two main 
types of difficulty appear in interpreting the 
standard error of measurement, The first con- 
cerns the problem of making statements con- 
cerning true scores on the basis of the standard 
error of observed scores. Guilford and Lind- 
quist prefer to make interpretations only interms 
of the distribution of obtained scores in repeated 
testing, or of the distribution of the observed 
scores of individuals with a given (assumed) true 
score. On the other hand, various writers make 
statements concerning the true score on the bas- 
is of an observed score and its standard error. 
They do this by implicitly or explicitly taking 
the observed score as the estimate of the true 
score, and using the standard error of the ob- 
served score as being equivalent to the standard 
error of a true score. While the observedscore 
is an estimate of the true score, itis not the 
true score, nor is it necessarily the best esti- 
mate of the true score. It would appear that this 
type of interpretation is therefore questionable. 
Statements concerning the true scores of individ- 
uals with a given observed score should be based 
on the regressed score and its standard error 
of estimate. While the effort involved in this 
procedure of estimating true scores may not be 
justified, statements concerning the true scores 
associated with observed scores should be lim- 
ited to such regressed scores. Rigorous logic 
would indicate, therefore, that statements 
should be of the type given by Guilford and Lind- 
quist. Nevertheless, use of the observed score 
as the estimate of the true score is commonly 
accepted, and appropriate statements concern- 
ing the true scores of individuals witha given 
observed score taken as evidence of the true 
score are given. 

A type of interpretation which makes state- 
ments about the group of individuals with a giv- 
en observed score was mentioned. No simple 
exact statement can be made about all the indi- 
viduals with a given observed score. A type of 
exact probability statement based on the appli- 
cation of tolerance limits has been suggested. 


2. Another way of stating it would be to say that the confidence interval computed from repeated sample 
estimates will cover the true score in the proportion of cases specified by the confidence coeffic- 


ient. 
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This problem is of little practical interest, how- 
ever, since we are usually not interested in 
making statements about groups of individuals 
with a given observed score. 

The second main difficulty concerns the type 
of statement which can be made. Probability 
applies to the statement, which is true ina giv- 
en proportion of cases, and not to the score or 
parameter, which is fixed, i.e., it either is or 
is not within the stated limits ina particular 
case. It cannot be stated, therefore, that the 
probability is two-thirds that a given individual’s 
true score lies within plus or minus one stand- 
ard error of his observed score taken as an es- 
timate of his true score. For all the individuals 
in the distribution, the probability is two-thirds 
that their true scores lie within plus or minus 
one standard error of their observedscores, i. e. 
it can be stated that ‘‘this individual’s true 
score lies within plus or minus one standard 
error of his observed score’’, and we shall be 
correct in two-thirds of the cases in the dis- 
tribution. 

This type of statement led to a consideration 
of the nature of probability statements which 
are appropriate in other interpretations. It is 


suggested that, following the frequency theory 
of probability, statements concerning an individ- 


ual score must be of this type, the probability 
applying to the truth or falsity of the statement, 
not to the score. For groups of individuals, or 
for the distribution of scores of an individual, 
probability statements can be made in the usual 
fashion, and the types ofappropriate statements 
are suggested. The use of tolerance limits 
makes it possible to make statements concern- 
ing the interval which will include a specified 
proportion of the scores on repeated tests of a 
group of individuals with a given observedscore, 
to which an exact probability caa be assigned. 
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SIMULTANEOUS EXAMINATION AND METHOD 
ANALYSIS BY VARIANCE ALGEBRA 


WILLIAM J. MOONAN* 
U.S. Naval Personnel Research Field Activity 
San Diego, California 


Introduction 


CONSIDERABLE attention is oftengiven 
to the construction of examinations which are to 
be utilized in some respect for educational and 
psychological research purposes. Frequently, 
these tests are used in method experiments of 
varying complexity. It is a matter of record 
that very little attention is given to the examina- 
tion after or during the execution of the experi- 
ments. Frequently consistency coefficients are 
quoted from data obtained from extra~-experi- 
mental sources, but it is more important to re- 
port these when they result from the imposition 
of the methods upon the experimental subjects. 
This is especially true, if the methods impose 
conditions which are somewhat different from 
those of the original examination standardization. 

Another important matter concerns whether 
or not any or a set of items ‘‘interact,’’ in the 
non-additive sense, with the different methods. 
In order to point out the relevancy to methods 
experiments, consider this hypothetical educa- 
tional example. Suppose a 50 item test is given 
to two classes of equal size ofarithmetic students 
who were taught by two diiferent methods. If 
the students of the first class answered the first 
25 items correctly and missed the last 25 items, 
while the students of the second class reversed 
this type of response, then the two sample means 
which are calculated from the total test scores 
of the two groups would be equal and the hypoth- 
esis of equality of the two methods would be sus- 
tained. 

Certainly we would not like to conclude that 
the methods of teaching were equivalent since we 
lack the necessary basis for judging the equiva- 
lence with a test whose items so obviously inter- 
act with the methods. This may happen on a 
smaller scale in real educational experiments 
and consequently should be guarded against. We 
need to use an examination with items which are 
effected by, but do not interact with, the factors 
of the experiment. Tests of hypotheses for item- 
method interactions and consistency coefficients 
would be of great value in the analysis of exper- 
iments utilizing examination data. 





It is the purpose of this paper to show how 
item-factor interactions may be tested in a 1-fac- 
tor experiment where the variables are obtained 
as a result of an examination score and to evalu- 
ate the internal consistency coefficient and the 
index of internal consistency of the examination. 
The factor will be referred to as ‘‘methods’”’ in 
the discussion which follows. These ideas may 
be easily generalized to other designs typically 
used in educational and psychological research. 


The Analysis of Examination Data in a Methods 
Design 





It is immediately recognizable that the item 
responses on the examination are not stochasti- 
cally independent for a particular subject who 
participates in either method. We shall make 
the assumption that this dependency is measured 
by p(I) which is Fisher’s intra-class correlation 
coefficient. Thus we may write the expected val- 
ue of the 

M 


I 2 S(m)=IS_ normal variables as 
m=1 


(1) E y(i, m, s) = O(i, m) =§& + c(i) + ~(m) + Oi, m) 


weeeeiei,..., jmet,..., Mp eei, 


cee» Oe 


In this equation, § represents the general effect, 
C(i), the effect of the ith item; s<(m), the effect 
of the mth method; and 6 (i,m), the effect of the 
(i, m)th interaction. The model may be repara- 
meterized by the following linear restrictions: 


I M ye M 
x c(ij=0; 2 pe(m)=0; & Oi,m)= 2 Oi, m) =0. 
i=1 m=1 i=l m=1 


The variance of any observation, y(i,m,s), is 
V[y(i, m,s)] = 07. The following table indicates 
this fact as well as provides the covariances, C, 
for various combinations of i,m, ands. The 
primes denote different values of the indices, ex- 


*The opinions expressed are solely those of the author and are in no way official; nor are they to be 
construed as representing those of the U. S. Naval Personnel Research Unit or Bureau of Personnel. 








254 JOURNAL OF EXPERIMENTAL EDUCATION 


cept where equalities are indicated. Notice that 
Cly(i,m,s), y(i,m, s')] = p (I)o? and that the var- 
iance, 0?, and the covariance, p(I)o?, are as- 
sumed to be equal for all methods. 














i=i' ifi' 

s=s' s#s' s=s' s#s' 
m=m'] o2 0 p(I)o? 0 
mém'} 0 0 0 0 

















We now follow a procedure suggested by Nan- 
di (3). An orthogonal linear transformation is 
made on the y(i,m,s). The equations of trans- 
formation are: 


we h 
(2) 2(h,m,s)=[hy(h+1,m,s)- % y(i,m,s)] / 


Vh h+l1; h=l,..., I-1 
I 
(3) z(I,m,s)=VI y(.,m,s)= 3 y(i, m, s)/VT. 
1= 


Hence, the variables z(l,m,s), z(2,m,s), 
..., 2(I-1, m,s) are normally and stochastically 
independent. The variance of the variables is 
o?[1 - p(I)] since 


h 
(4) V[z(h, m,s)] = {h? v[y(i+1, m,s)] + v[ 2 
i= 
oo A 
y(i,m,s)] - 2hC[y(h+1,m,s), = y(i,m,s)] } 
i= 
/ bel ={ [h?o?) + [ho? + bh h-i p(o?] 
- 2h[h p(1jo?] }/h hel = o?[1 - p()], and 
C[z(h,m,s), z(h',m,s)] = 0. 
Also it is easy to show that V [z(I, m,s)] 
= 0 [1 = I-1 ot and C[z(1, m,s), z(h, m, s)| 
= 0. Clearly, & z(1,m,s) = VI[€ +(m)] and is 


functionally independent of c(i) and @(i,m). Be- 
cause of the orthogonal transformation, 


M S(m) I 

(5) = 2 & [z(i,m,s)-€2(i,m,s)]? 
m=1 s=1 i=l 
M S(m) I 


zu z z i aad ‘ a 
m=l s=l i=1 [y(i, m, Ss) Ey(i, m, s)] 
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We now break up the left side of (5) into two 
parts: 


M Sm) I-i 2 M 
(6.) =. & 2 [z(i,m,s)-Ez(i,m,s)] + 2 
m=1 s=<1 i=} m=1 


S(m) 2 
z [z(Il,m,s)- €z(I,m,s)] . 
$= 


Making use of the fact that 


M S(m) 2 M 
(7) = & {[z@,m,s)-Ez,m,s)] = = 
m=1 s=1 m=1 


S(m) nt 
<2 My,m,s)- § -(m)] , 


we are finally able to discern the following 
equality 


M S(m) I-1 2 

(8) © 2 & [z(i,m,s)-€z(i,m,s)] = = 
m=] s=1 i=1 m=1 
S(m) 


I 2 
= [y(i,m,s)-y(. ,m,s) - ¢(i) - @(i, m)] 
s=] i=] 


which has I-IS degrees of freedom associated 
with it. The total sum of squares (5) has thus 
been partitioned into two parts which may be 
called the ‘‘within methods’’ sums of squares, 
(8), and the ‘‘between methods’’ sums of 
squares, (7). 

This design has features which are similar 
to the ‘‘split-plot’’ design which is frequently 
used in agricultural research (see refs. 1, 4). 
For lack of a better name, the model being an- 
alyzed here might be termed the M-sample intra- 
and inter-method linear hypothesis. The hypoth- 
eses of the intra- and inter-method partitions are 
tested with different precisions. These preci- 
sions are estimates of 0? [1 - p(I)] and o? [1 + 
I-1 p(I)] respectively. The tests made within 
one partition are stochastically independent of 
the tests made within the other 

The rank, or degrees of freedom, of the esti- 
mation space for the intra-method partition is 
IM-M since there exists IM-M linearly independ- 
ent interaction parameters, whichpersist through 
the reduction of the reparameterized estimation 
space of € [y(i, m,s)-y(.,m, s)] = (i)+ (i, m) 
by elementary transformations. The rank for 
the error sums of squares for this partition is 
given by I-IS - IM-M=I-1 S-M. The hypoth- 
eses ¢(i) = 0 and @(i, m) =O yieldsums of squares 
which caneasily be shown to have I-land I-1 M-1 
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degrees of freedom associated with them. 

The rank of the estimation space for the in- 
ter-method partition is M and the rank for the 
error sums of squares is S-M. The hypotheses 
€ = §(0) and y«(m) = 0 have sums of squares with 
i and M-1 degrees of freedom. As a result of 
these observations and some further algebra, 
we find the analysis of variance given by Table I. 

With the aid of the statistics calculated for 
the entries in the sums of squares column of 
Taole I, it is possible to test the various hypoth- 
eses listed in the first column. These tests in- 
clude the desired item-method interaction test 
which is obtained by using the mean squares of 
Ss[Is] and ss[S] as an F ratio with the degrees 
of freedom_(1) =I-1_M-1_and (2)=I-1 S-M, 
that is, F[I-1 M-1, I-1 S-M;1-a]. 


Considerations of Examination Internal Con- 
sistency 





Now we shall be concerned with a measure 
of the internal consistency or the intra-class 
correlation, p(I), of the responses to the litems. 
Also we shall define and estimate a function of 
p(I) which is called the index of internal consis- 
tency, o(H). 

Hoyt (ref. 2), was among the first to define 
an index of internal consistency measure which 
was derived by variance algebra. In terms of 
the notation used here, the parametric value of 
the index is 


(10) (H) = 1 - V[z(i,1,s)] /v[zq,1,s)]. 


In equation (10), V[z(i,1,s)] is the variance 
associated with effects which are not identified 
with the items of the examination or its admin- 
istration to the subjects—that is, ‘‘error’’ var- 
iance, and Vv [z(1, l, s)] is the variance associ- 
ated with the scores of subjects who have taken 
the examination specified by the conditions ident- 
ified by the number 1. Thus, if V[z(i,1,s)] 
nearly equals V[z(I,1,s)], then the index of in- 
ternal consistency is small. This means that 
the precision associated with individual scores, 
composed of responses to I items, is large rel- 
ative to the ‘‘error’’ variance. In other words, 
the examination does not discriminate between 
i <ividuals very well. As a consequence, con- 

ience intervals for individual scores would be 
. elatively large. 

It is important to realize the distinction be - 
tween p(I) and p(H). The parameter, p(I), is a 
covariance property of the responses to the it- 
ems of the examination. The parameter, p(H), 
is an index of a relative precision property of a 
score composed by a simple linear function of 
the responses to the examination items within a 
Single administration of the examination to each 
subject. If the examination is administered sev- 
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eral times to the same subjects, a derivation 
similar to the one shown earlier inthis paper 
leads to the evaluation of another covariance 
property of the scores which is knownas the co- 
efficient external consistency. This value is 
often termed the ‘‘reliability’’ of the examina- 
tion. 

The expressions V[z(i,1,s)] and V[z(I, 1,s)] 
are merely special cases of o2 [IS] and o [S] 
of Table I where we consider only a single meth- 
od (M=1) or a particular set of conditions —ex- 
perimental or otherwise. Of course, in this 
case, H[IM| and H[M] are not tested since no 
degrees of freedom exist for the tests. Thus 
the procedure given by Hoyt (ref. 2) is a special 
case of this analysis. 

From Table I, 


(11) (H)=1- fet, ro aioe pe 
V[2(I, 1, s) o? [1+I-Tp(i) 
1+I-Lp(I) * 


It is easy to see that the index of internal 
consistency, o(H), is a directly proportional 
function of o(I). Therefore, for a fixed I, as 
(I) increases; so does o(H), and inversely. If 
we express 0?[S] as a function of o(H) rather 
than e(I) we have 


(12) o?[s] = 10? / [1 - -ip(m)). 


From equation (11), which has the same form 
as the famous Spearman-Brown prophecy form- 
ula, we learn that the index of internal consis- 
tency increases as the examination length in- 
creases. An increase of the number of covari- 
ately consistent items would no doubt increase 
the variability between subjects. This is shown 
by o?[S] = 0?[1 + 1-1 o(I)]. However, the er- 
ror variance, 02[IS] =0?]1-p(I)], is invariant 
to the number of such items. Also, 0? [IS] =0? 
[s] {1 -e(H)} , which can be shown by simple 
algebra. Thus we cannot increase the absolute 
precision associated with the errors of meas- 
urement by increasing the examination length, 
but the relative precision is increased. The par- 
ameters o?[S] and o?[IS], or their estimates, 
are often expressed in score units rather than 
in item units. This is accomplished by multi- 
plying those values by the number of it- 
tems, I. 

A test of the hypothesis that o(I) = 0, and 
consequently that o(H) = 0, is provided by the 
ratio SS[S] / I-M SS|[IS] which is distributed 
as F[S-M, I-M S-M;1-«]. The upper tail 
of the F distribution should contain the critical 
region since if p(I) > 0, the expected value of 
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the mean square of SS[S] is greater than the ex- 
pected value of the mean square of SS[IS]. The 
alternative hypothesis e(I) < 0 is not of much 
practical importance because o?[S] >. 0 and 
consequently (I) > - 1/I-1. The (1 - «)% con- 
fidence interval for p(I) may be obtained from 
the expression given in (13) wherein F [o] is 
the ratio of the observed mean squares, ms([s] 
/MS[IS]; F[U] = F[S-M, I-1 5-M; 1-a/2 
and F |LJ=i/F[S-M Fi, 5-M;1-« /2 
are values obtained from a variance ratio table. 


13) Flel - FIL 
as) Zeller 00> 


A point estimate for p(I) can be obtained from 
(13) by using either the left or right hand side 
of the inequality and setting F[Lt or F {u] equal 
to 1. The confidence interval for p(H) is shown 


F fo] - Flu 
I-1 F[U] +Fl[o 


in (14). The point estimate of o(H) is found by 
setting F[u] or F[L] equal to 1. 


(14) 1-F[U] /F[o] > p(H)> 1-F[L] /F[o]. 


It is also possible, and often desirable, to 
partition each of the sums of squares and de- 
grees of freedom of Table I into M parts. From 
these, tests and estimates of o| I(m)j and o[H 
(m)] can be made. If we call r[H(m)] the sam- 
ple estimate of the index of internal consistency 
for the mth method, then o(H), the total index is 
the weighted average of the r{H(m)]. The weights 


are the sums of squares for the errors, E[IS(m)]. 


Before the tests are carried out by means of 
Table I, it is advisable to check the hypotheses 
o? [1S(m)] = o? [Is(m')] and o? [S(m)] = o? [S(m'}] 
using homogeneity of variance tests. 


Summary 
Starting from the definition of the linear mo- 


del of an observation obtained as the result of 
the administration of an examination in a simple 
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experimental situation, an analysis of variance 
was obtained whereby tests of hypotheses asso- 
ciated with methods, items, inter~method inter- 
actions and the general mean were made, Fur- 
thermore, point estimates and confidence inter- 
vals were shown for statistics which estimate 
the: j 


a. internal consistency—the intra-class 
correlation of the responses to the 
items of the examination, 


. index of internal consistency——a meas- 
ure of the relative precision of a score 
which is formed by a linear function of 
the responses to the items of the exam- 
ination. 


The evaluation of the internal consistency of 
an examination is important for comparison and 
item construction purposes. The index of intern- 
al consistency provides a convenient measure of 
the relative precision of the error and subject 
variances of the examination. The derivation 
given in this paper is easily extended to other 
common experimental designs. 
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SIS WITH OBJECTIVES OF MINIMIZING 
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Principles for Classification 





THE PURPOSE of this study is to 
describe and illustrate certain developments in 
the field of classificatory analysis whichare po- 
tentially useful in education. The problem un- 
der consideration arises when a number of 
measurements are made onanindividual and itis 
desired to classify the individual into one of sev- 
eral categories on the basis of these meas ur e- 
ments. It is assumed that the individual has 
been randomly drawn from one of the populations 
and that in each population there is a probability 
distribution of the measurements. 

In provlems of this type itfrequently may 
be observed that the consequences of wrong 
decisions are not equally undesirable. Somewhat 
unique is the assumption that it is possible to 
specify the loss of utility resulting from a wrong 
decision or misclassification. Loss may be 
thought of as the penalty paid by the statistician 
when his guess is incorrect. In this investiga- 
tion it serves to differentiate the seriousness of 
the possible errors of classification. 

By the risk of committing a certainerror 
in classification is meant the probability of that 
error multiplied by the corresponding loss. Risk 
thus corresponds to the expected value of the 
loss or the expected disutility in the long-run 
use of some classification rule. In considering 
the problem of classifying an individual into one 
of several populations there is a risk or expect- 
ed loss associated with each possible choice 
or Classification. 

Important in the solution of problems of class- 
ification are the a priori probabilities whose 
values postulate the respective chances 
of drawing an individual from each of the popu- 
lations under consideration. The crite rionby 
which classification rules are chosen varies 
with the presence or absence of thea priori 
probabilities. If they are assumed to be known, 
rules are selected which minimize overall risk 





of misclassification. If in a particular problen. 
it is not possible or appropriate to estimatea 
priori probabilities, the solution requires a dif- 
ferent criterion of goodness. Under such condi- 
tions it is possible to choose classification rules 
which minimize the maximum risk of misclassi- 
fication. This is a conservative approach and 
represents an application of the minimax princi- 
ple. 

Anderson (1) discusses thesecriteria in detail. 
Rao (6), Brown (2), and Welch (9) are useful ref- 
erences. More formal reasoning is gener- 
ally expressed as follows: Assume, for exam- 
ple, that an individual I with the set of measure- 
ments X,....,Xq has been randomly drawn 
from one of two populations II, or I, with prob- 
ability distributions of the measurements f,(x,, 
++++,Xq) and f,(x,,....,Xq), respectively. The 
choice of a classification rule R corresponds 
to a division of the q-dimensional sample space 
into two regions R, and R,. If the random point 
corresponding to the individual falls in R, the de- 
cision is made that I is from I,, and ifthe point 
falls in R, the decision is made that I is from 
II, . 

Let L(2/1) be the loss incurred in classifying 
I into IL, when in fact I belongs to II,, and 
let L(1/2) be the loss incurred in classifying I 
into Il, when in fact I belongs to TIz,. For any 
rule R let r,(R) be the conditional risk or ex- 
pected loss when I belongs to I],. It is defined 
as the probability of I falling in R, multiplied 
by the loss associated with this mistake. Simi- 
larly, let r,(R) be the conditional risk when I 
belongs to II,. It is defined as the probability 
of falling in R, multiplied by the corresponding 
loss. 

Suppose there exist a priori probabilities that 
I comes from the two populations, say probabil- 
ity p, that I does in fact belong to T1,, andprob- 
ability p, that I belongs to II,, where p, + p2 = 


*Summary of unpublished doctoral dissertation, University of Minnesota 1954. Advisor: Palmer 0. Johnson, 


Professor of Education, University of Minnesota. 
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= 1. It is then possible to define an unconditional 
risk for any classification rule R as equal to p, 
r,(R) + per2(R). The objective is tofindthe re- 
gions R, and R, which minimize this risk. 

In other problems nothing may be known 
about the probabilities p, and p, of drawing from 
the two populations. Here the unconditional risk 
cannot be defined and it is necessary toconsider 
the risk r,(R) when I is from 11,, and r,(R) 
when I is from Iz. It can be shown that the 
minimax requirement will be satisfied when r, 
(R) = r2(R). 


Multivariate Normal Solutions 





Of special interest in application are solu- 
tions for the multivariate normal case. Anderson 
(1) treats these solutions and his results are pre- 
sented here without accompanying mathematical 
arguments. For two populations, let x,,.... »Xq 
have multivariate normal distributions with means 
in 1, = uj andinM, = uj, i- :,....,q. Assume 
the two distributions have common variances and 
correlations. The regions for minimum riskare 
defined by the following discriminant func tions 
or classification rules: 

4 / q 1 2\- 

R, es Aix; ~1/2E A, (u; + uj )7 logek 

= l=1 


q q 
R,:2 A, x; - 1/22 
i=1 i=l 


A, (uj + uj) Z loggk 


where X,,.... » Aq is the solution of 
q 
amma a.° et oe 


k = L@Q/2)p, 
L(2/ 1)p, 


If L(1/2) = L(2/1), k = p,/p, which is the sol- 
ution of minimum probability of misclassifica- 
tion. Ifa priori probabilities are not specified, 
logek(=c) may be found so that the risk when I 
is from [I], is equal to the risk when I is from 
Il,. The solution employs the univariate normal 
table. 

Similar formal solutions are obtained for m 
multivariate normal populations with the same 
set of variances and correlations. In practice ~ 
these solutions present great difficulties when 
the number of populations is greater than two. 
For example, when three groups are considered 
it is necessary to assume equal costs of mis - 
classification. 


Assumptions 
Of basic importance is the assumption that it 
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is possible to assign a comparable numerical val- 
ue to each of the losses incurred in misclassifica- 
tion. While presently this assumption is difficult 
if not impossible to satisfy with rigor, it should 
be noted that the usual assumption of equal costs 
of misclassification is often unrealistic. Utility 
measurement is an important contemporary proo- 
lem in the economic field. The reader is refer- 
red to Friedman and Savage (4) for one account of 
the historical background of utility discussions. 
An undecided issue appears to be whether utility 
(or disutility) is actually measurable or whether 
any choice among alternatives depends merely on 
their ordering. Coombs (3) in the socialsciences 
suggested the use of a scale not involving a unit of 
measurement designed to order the magnitude of 
the interval between objects. The scale was in 
addition to the types set up by Stevens (7) andfell 
logically between the interval scale and the ordi- 
nal scale. 

It is assumed in the model that the parameter 
values in the multivariate normal distributions 
are known exactly. This is seldom true in prac- 
tical applications. Although Wald (8) and Ander- 
son(1) have developed classification statistics 
which employ sample estimates for parameter 
values in problems involving two multivariate 
normal populations with the same covariance 
matrix, the distributions are presently unsuitable 
for practical use. A reasonable solution consists 
in obtaining the best possible estimates and sub- 
stituting them for the unknown parameters in ob- 
taining the classification rules. For sufficiently 
large samples it is possible to proceed as if the 
parameters were known though some unknown er-~ 
ror is thereby introduced. 


Illustrative Problem 





The General College of the University of Min- 
nesota maintains a staff of trained counselors who 
devote their time primarily to helping individual 
students. A principal function of these counselors 
is to assist first quarter students in the selection 
and planning of an appropriate educational goal. 

In this selection process it is important to con- 
sider the capabilities of the new student with re- 
spect to other groups of General College students. 
Do his capabilities most nearly resemble those of 
students who drop out of the General College, stu- 
dents who are accepted for transfer by another 
college of the University, or students who continue 
in the General College to receive an associate in 
arts degree? These are the groups employed in 
classification. 

Recommendations are customarily made ona 
judgmental basis using information available to the 
counselor and the beginning student. Objective 
measures frequently used for this purpose include 
high school percentile rank, score on the American 
Council on Education Psychological Examination, 
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score on the Cooperative English Test, and 
scores on the Comprehensive Examination of the 
General College. These are the variables em- 
ployed for establishing the classification rules. 

It was considered desirable in the design of 
this investigation to provide for comparable 
groups on which to establish classification rules 
and cross validate the results. This was done by 
selecting a random half of all students matricu- 
lating in the General College during the fall 
quarters of 1949 and 1950. Within each half 
students defined by population restrictions were 
assumed to be samples of their respective 
populations. Sample sizes in the main analysis 
were 273 for the drop out group, 76 for the grad- 
uate group, and 79 for the transfer group. Sam- 
ple sizes used in cross validation were 264 for 
the drop out group, 75 for the graduate group, 
and 88 for the transfer group. 


Classification Into One of Two Groups 





Applications were made to the problem of 
Classifying students into one of two categories: 
(a) students who were accepted for transfer by 
another college of the University and did not 
graduate from the General College and (b) stu- 
dents who graduated from the General College. 
Similar classification rules were established us- 
ing two different sets of variables. The duplica- 
tion was made to examine the models under dif- 
ferent conditions of application. For each set of 
variables three solutions were obtained corres- 
ponding to the following standards of goodness: 
minimizing risk of misclassification, minimizing 
maximum risk of misclassification, and minimiz- 
ing probability of misclassification. Mean values 
for the two groups on four selected variables were 
as follows: 


Transfer Graduate 
ACE 94.16 87.83 
Eng. 167. 38 162. 62 
HSR* 4.51 4. 41 
Comp. 60. 28 53. 43 
*profit values 


The dispersion matrix based on the two groups is 
shown below: 


248.49. 
251.30 
1. 06 1.90 . 39 1. 46 


251.30 1.06 
930.73 1.90 


110. 70 
167. 45 
110.70 167.45 1. 46 216. 47 
The regions were defined by the following rules 
or discriminant functions: 


Ry: .0185Xj - .0046X2q + .1494X_ + .0247X4 - 
2.995 7 logek 





JOHNSON 


Ro : .0185X, - .0046X_ + .1494X, + .0247X, - 
2.995 / log.k 


where values of k were chosen according to the 
described criteria of goodness. 


Approximations of losses due to misclassifi- 
cation were obtained from three experienced 
counselors in the General College. Since the 
counselors were not in agreement, a median 
judgment was employed in the analysis whe re 
disutility in misclassifying transfers = 3 and 
disutility in misclassifying graduates = 2. 
Values assumed for the a priori probabilities 
were based on the relative frequencies of the two 
groups. These values were .51 and . 49 for trans- 
fers and graduates, respectively. Subsequently, 
the solution of minimum risk correctly classified 
a large majority of students actually belonging to 
the transfer group, whereas the same solution 
misclassified an equally large majority of stu- 
dents belonging to the graduate group. The tend- 
ency to classify most people as transfers was pro- 
nounced since the difference between groups was 
not very large while the difference in disutilities 
was comparatively great. A similar pattern 
should result if a counselor were to employ the 
same variables under a criterion of minimum 
risk. 

The solution in which maximum risk of mis- 
classification was minimized produced classifi- 
cation frequencies which were intermediate to 
those obtained by the other solutions. This rule 
was not as efficient in reducing overall risksince 
it was based on a lesser amount of information, 
namely, the a priori probabilities of drawing 
from the two populations. In terms of overall 
number of correct classifications the ‘‘best’’ so- 
lution was that in which probability of misclass- 
ification was minimized. This solution assumed 
equal costs of misclassification and, consequent- 
ly, represented a different criterion of goodness. 
Tables I, Il and III present the results of the 
cross validation studies for the various classifi- 
cation rules. 


Classification Into One of Three Groups 





In the three group solutions it was necessary 
for purposes of technical simplification to as- 
sume equal costs of misclassification. This es- 
sentially meant that classification rules were 
chosen on the basis of probabilities of misclass- 
ification rather than risks of misclassification. 
Another distinguishing feature of the three group 
solution was the use of the bivariate normal table 
(5) to obtain the minimax solution. 

Students were assigned to one of three groups: 
(a) transfer, (b) graduate, and (c) drop out. Sep- 
arate rules were established with objectives of 
minimizing probability of misclassification and 
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TABLE I 


CLASSIFICATION BY SOLUTION OF MINIMUM RISK 








Probabilities of 
























































True Predicted Category Misclassification 
Category Transfer Graduate Total Empirical Predicted 
Transfer 73 15 88 Pe .14 
Graduate 63 12 75 . 84 me | 
Total 136 27 163 . 50 . 42 

TABLE II 
CLASSIFICATION BY MINIMIZING THE MAXIMUM 
RISK OF MISCLASSIFICATION 
Probabilities of 

True Predicted Category Misclassification 
Category Transfer Graduate Total Empirical Predicted 
Transfer 54 34 88 . 39 . 32 
Graduate 46 29 75 .61 . 48 
Total 100 63 163 

TABLE I 
CLASSIFICATION BY SOLUTION OF MINIMUM PROBABILITY 
OF MISCLASSIFICATION 
Probabilities of 

True Predicted Category Misclassification 
Category Transfer Graduate Total Empirical Predicted 
Transfer 50 38 88 . 43 .37 
Graduate 37 38 75 . 49 . 42 
Total 87 76 163 . 46 .39 
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TABLE IV 


CLASSIFICATION BY SOLUTION OF MINIMUM PROBABILITY 
OF MISC LASSIFICATION 

















Probabilities of 
True Predicted Category Misclassification 
Category Transfer Graduate Drop Out Total Empirical Predicted 
Transfer 15 0 30 45 66 79 
Graduate 3 0 31 34 1.00 . 99 
Drop Out 4 0 117 121 . 03 . 04 
Total 22 0 178 200 . 32 .35 





TABLE V 


CLASSIFICATION BY MINIMIZING THE MAXIMUM PROBABILITY 


OF MISCLASSIFICATION 





True 
Category 


“Probabilities of 
Predicted Category Misclassification 


Transfer Graduate Drop Out Total Empirical Predicted 








Transfer 
Graduate 
Drop Out 


20 19 6 45 . 56 -55 
7 21 6 34 . 38 .55 
14 50 57 121 . 53 . 55 








Total 





41 69 200 
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minimizing maximum probability of misclassifi- 
cation. Customarily, only one standard of good- 
ness would be adopted depending upon the pres- 
ence or absence of a priori probabilities of draw- 
ing from the three populations. In this example 
probability of drawing from the transfer group = 
. 18, from the graduate group = .18, andfrom 
the drop out group = . 64. 

In the solution of minimum probability of mis- 
classification most students in the cross valida- 
tion groups were assigned to the drop out cate- 
gory. This was because differences between 
groups were not very great, while differences 
between a priori probabilities were comparative- 
ly large. When probabilities were not assumeda 
solution was obtained which minimized the max- 
imum probability of misclassification. Although 
not as efficient in terms of overall number of cor- 
ect classifications, the minimax solution classi- 
fied students ina more normal fashion. Tables 
IV and V summarize the cross validation results 
for the three group solutions based on a sample 
of 200 students drawn at random from all valida- 
tion groups. 


Summary 


This investigation had its setting in the Gen- 
eral College of the University of Minnesota and 
illustrated the classification of students into ed- 
ucational groups on the basis of measures com- 
monly used for counseling purposes. The groups 
were assumed to represent multivariate normal 
populations with the same set of variances and 
correlations. Comparable samples were obtain- 
ed on which to establish classification rules and 
cross validate results. This was done by select- 
ing a random half of all students matriculating in 
the General College during the fall quarters of 
1949 and 1950. 

Principles for choosing a rule of classifica- 
tion were based on an approximation of the losses 
incurred with misclassification. By the risk of 
committing an error in classification was meant 
the probability of that error multiplied by the 
corresponding loss. The criterion by whicha 
rule was chosen varied with the presence or ab- 
sence of a priori probabilities of drawing an in- 
dividual from the populations. If probabilities 
were assumed to be known, a rule was selected 
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which minimizea ris. uf misclassification. As- 
Suming it was not possible or appropriate to as- 
sign a priori probabilities, a rule was chosen 
which minimized the maximum risk of misclassi- 
fication. When three groups were involved it was 
necessary to assume equal costs of misclassifi- 
cation. 

The techniques are applicable in theory to 
most classification problems. They should be 
useful, for example, in making decisions or 
whether to admit or refuse to admit students to 
college. While a rigorous determination of utility 
values remains a serious obstacle, an approxi- 
mated value may sometimes be more realistic 
than the assumption of no differences. 
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THE POISSON DISTRIBUTION IN 
EDUCATIONAL RESEARCH 


SAM DUKER 
Brooklyn College 


EDUCATIONAL researchin many in- 
stances consists of gathering data about a sam- 
ple and then generalizing the resulting findings 
to the population from which the sample was se- 
lected. The inferences thus made are valid 
only if, first, the sample is properly selected, 
and second, appropriate statistical techniques 
are employed. The interesting and often quite 
involved questions involved in the process of 
sampling are not within the scope of this paper. 
Here only the phase of the statistical process 
by means of which inferences are made abouta 
defined population on the basis of data whichare 
collected from a sample of that population is 
dealt with. 

Fundamentally, such inferences can only be 
drawn when the shape of the distribution formed 
by the variables making up the populationcan be 
justifiably hypothesized. In educational research 
there has been a noticeable tendency to assume 
that all distributions are of the normal or bell - 
shaped curve type. This assumption is not al- 
ways justified. The normal distribution isa very 
convenient statistical model; it is neatly sym - 
metrical and mean, mode and median coincide. 
In real situations we seldom find such neat sym- 
metry nor do we find many cases where t he 
mean, mode and median coincide. These stat- 
ments in no way depreciate the importance of 
the concept of the normal distribution for this 
concept is an extremely important and useful 
one in educational research. 

When the shape of the distribution followed by 
the population is not known it becomes necessary 
to use non-parametric statistical devices. Such 
techniques are applicable regardless of the shape 
of the distribution and are therefore highly use- 
ful. More accurate and usable results are possi- 
ble, however, when the shape of the distribution 
can be validly hypothesized. 

Many distributions other than the normalare 
known to statisticians. One of these is the 
Poisson distribution. Curiously enough, the ex- 
istence of this distribution was first noted in 
1740 by Abraham de Moivre in his Doctrine of 
Chances (1) and later by the manaiter 
whom it is named. Simeon Denis Poisson, a 
French mathematician who lived from 1781 to 
1840 and who was a man of rare diversity of 
talents, developed a derivation for this distri- 
bution in 1820 (2) in a work on the application 
of the laws of probability to the determination 





of the likelihood of miscarriages of justice in 
courts of law. Since that time this distribution 
has been independently derived by a number of 
other investigators. The Poisson distribution 
may be derived either as a limit of the binomial 
distribution or as an independent distribution of 
randomly distributed rare events. 

The Poisson distribution is a discrete func- 
tion which exists only for positive integral 
values of the variate. When the mean is less 
than one the probability polygon is J- 
shaped. When the mean is greater than one it 
becomes double sided and for larger values of 
the mean tends toward symmetry. This is an- 
other way of saying that when the mean is less 
than one, the mode is zero. When the meanis 
greater than one, but not an integer, the mode 
is the nearest integer less than the mean. When 
the mean is an integer greater than or equal to 
one the probability polygon is bimodal, the 
modes being the mean and the integer immedi- 
ately below the mean. 

In the Poisson distribution the mean (which 
is equal to the variance) performs a function 
similar to that performed by the mean and the 
standard deviation together in the normal dis- 
tribution, for knowledge of the mean enables 
one to determine the distribution and all its 
moments. 

The Poisson distribution is rarely employed 
in educational research. The ever increasing 
use of statistical techniques ineducational 
research makes it important that other than 
normal distributions be employed where neces- 
sary. Not all non-normal distributions will 
follow the Poisson law, but, when the assump- 
tions underlying the Poisson distribution appear 
to be met, the hypothesis that the data may be 
distributed in such a manner should be tested. 
More accurate analysis and consequently more 
reliable statistical inferences are possible 
when the true nature of a distribution is recog- 
nized, whatever its form may be. 

Data may be expected to fall into the form of 
the Poisson distribution only when certain basic 
assumptions are met. These assumptions are 
five in number. 


1. Discreteness. The data involved must 
be discrete rather than continuous. The event, 
the occurrence of which constitutes the data, 
must be of the type which either occurs or does 
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not occur. It is an ‘‘all or none’’ proposition. 
Scores obtained from measurements yielding 
fractional or negative values cannot, therefore, 
under any circumstances, be said to constitute 
data subject to the Poisson law. 

2. Possibility of large number of occur- 
rences of the event. The possible number of oc- 
currences of the events at any particular oppor- 
tunity must theoretically be infinite. In practice, 
of course, such data are not to be expected. It is 
essential, however, that in such specified divi- 
sion of time, space or other unit, there must ex- 
ist a possibility of a large number of occurrences 
of the event, the frequency of the occurences of 
which constitutes the variable making up the dis- 
tribution in question. This requirement must not 
be confused with a statement concerning the size 
of the sample under consideration. The size of 
the sample could not influence the, nature of the 
population from which the sample is drawn. If, 
however, the sample is small the form of the pop- 
ulation distribution cannot be determined with 
much confidence. 

When the possible number of occurrences is 
small, a distribution is often observed to have 
frequencies conforming closely to the theoretical 
frequencies for zero, one and two. The hypothe- 
sis that the Poisson distribution is present must 
be rejected in such cases, as it must inall 
cases where the basic assumptions are not met. 

3. Small likelihood of the event occurring at 
any particular opportunity. The hypothesis that 
a Poisson distribution is present cannot be ac- 
cepted unless the probability that the event will 
occur at any one given opportunity is very small. 
The requirement that there be a large number of 
possible occurrences must be distinguished from 
the present limitation. It must be borne in mind 
that in order to establish the presence of a Pois- 
son distribution both of these requirements must 
be met and that compliance with only one of them 
is insufficient. 

4. Equality of opportunity for occurrence. 
The specified divisions of time, space, or other 
units which are so cunstituted that the number of 
occurrences of an event or the frequency of in- 
dividuals can be counted for each division pre - 
sent ‘‘areas of opportunity for occurrences. ’’ 
Each such division or area must present an 
equal opportunity for the occurrence of the event 
being observed. It would not be possible, for ex- 
ample, to use as such an area or division, 
schools of a city or state when because of vary- 
ing qualities as to size or excellence, suchschools 
present varying probabilities of an event occur- 
ring or not occurring in a division. Inother 
words, where the probability of the occurrence 
of an event varies from division to division, 
whether such division be one of time, place, or 
other unit, the Poisson distribution may not be 
used as a model for the empirical dist ri bution. 
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Sometimes a compound Poisson distribution 
may occur in such cases. 

5. Independence. The occurrences of the 
event must be independent, that is to say,a. the 
occurrences of an event at one opportunity must 
have no effect on the probability of a second oc- 
currence of the event at the same opportunity, 
and b. the occurrence of the event at one oppor- 
tunity must have no effect on the probability of 
an occurrence of the event at another opportun- 
ity. 

The assumption of independence is of great 
importance in determining the propriety of ac- 
cepting a hypothesis that the Poisson distribu - 
tion is present. It is essential that the occur - 
rence of that event at that same time or place 
or at another time or place. Many educational 
data seem to conform to the Poisson distribu - 
tion in all other respects such as equality of 
mean and variance, correspondence of observed 
and theoretical Poisson frequencies and the 
meeting of allother assumptions, but fail to meet 
this requirement. 

There are, for example, several situations 
having to do with new processes in education in 
which at first glance the distributions of the fre- 
quency of adoption of such new processes appear 
to follow the Poisson law. Careful analysis, 
however, reveals that the adoption of sucha 
process in one situation may well tend to influ- 
ence its adoption in another and that, therefore, 
the requirement of independence is not fulfilled. 
Under such circumstances accepting the hypo- 
thesis that the Poisson distribution may be re- 
garded as a model would be erroneous. 

In general, the Poisson distribution will be 
found only in cases where the distribution is 
highly skewed, but this is not necessarily al- 
ways so. As already noted, the Poisson distri- 
bution tends toward symmetry when the meanis 
large. 

In a Poisson distribution the mean and vari- 
ance are equal. While the near equality of mean 
and variance certainly is a rough indication that 
the possibility of the presence of a Poisson dis- 
tribution should be investigated, it is not a suf- 
ficient basis for assuming the presence of such 
a distribution. Distributions with equal or near- 
ly equal mean and variance may fail to meet the 
assumptions underlying the Poisson distribution. 
If such cases the hypothesis that this distr ibu- 
tion is followed by the data must be rejected 
even when the observed frequencies equal or ap- 
proach the tabulated frequencies of the Poisson 
distribution. 

It is important to know that an empirical 
distribution of educational data follows the Pois- 
son law because with such knowledge a better 
analysis and understanding of the data are pos- 
sible. First, it is possible to estimate the 
liklihood of an event occurring or not occurring 











March, 1955) 


with greater accuracy because of the availability 
of tables of the Poisson distribution. Second, it 
is vital to the use of various statistical tests to 
know the nature of the distribution involved be - 
cause such tests are valid only when certain bas- 
ic assumptions underlying them have been met. 
When the basic assumptions for such a test are 
not known to be satisfied by data distributedin a 
Poisson distribution the use of such a test can- 
not be justified. Third, it is importantin per- 
forming an analysis of variance, as a transfor- 
mation is known which makes possible the appli- 
cation of this procedure to data distributed in ac- 
cordance with the Poisson law. It is here that 
the educational research worker is particularly 
concerned because he so frequently uses analysis 
of variance procedures. Often such an analysis 
is performed without any examination of the pres- 
ence or absence of equality of variability in the 
distribution of the data being examined. Since 
such equality of variability is a basic assumption 
underlying this technique, its absence is fatal to 
the correctness and validity of the results ob- 
tained. When a difference is claimed to be sig- 
nificant or non-significant on the basis of a very 
small departure from a given F-value, sucha 
failure to obtain precise results may invalidate 
an important conclusion. 

The number of instances in which the Pois- 
son distribution has been found to be applicable 
in research fields other than education is very 
large. Only a few examples can be given here. 

Research workers in the field of accident 
proneness have found the Poisson distribution to 
be fundamental to their work (3). 

Many situations occurring in both pedestrian 
and vehicular traffic have been found to fit the 
Poisson form (4). This fact has been utilized in 
tabulating the most efficient timing of traffic 
lights as well as the most efficient placing of high- 
way stop signs. 

Many natural phenomena occur in such a ran- 
dom manner as to meet all the assumptions under- 
lying the Poisson distribution. A few of themare 
radio-active disintegration (5), chromosome 
breakages produced in cells by X-Ray (6), the 
number of nuclei of condensation in the air (7), 
and the number of meteoric falls per small geo- 
graphic region (8). 

Quality control tables for sequential samp- 
ling purposes are based on this law (9). 

The new field of operations research has been 
developed in large part on the basis of the Pois- 
son law (10). 

Telephonic engineers use the Poisson distri- 
bution in determining the number of cables, 
switchboard stations, and so forth needed to 
serve a given amount of traffic (11). 

In concluding this abbreviated list of non-ed- 
ucational situations involving this distribution, a 
few odd instances will be given. 
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Singh (12) found that the daily number of ticket- 
less passengers traveling on an Indian railway 
were distributed normally but that the number 
who had to be evicted for refusal to pay their 
fare fell into the form of a Poisson distribution. 

The classic example is the one given by Bort- 
kewitz (13) who found that the number of menper 
year per Prussian army corps who were killed 
by the kick of a horse fell into this form. 

In conclusion one example of educational data 
distributed in accordance with the Poisson law 
will be discussed. 

When a rote task of the sort in which perfect 
mastery is the goal is first introduced to learn- 
ers, the probability of errors being made by 
these learners is very large. This probability 
becomes smaller as the stage of perfect mast- 
ery of the rote task is approached. Of course, 
the possibility of making errors remains equally 
large at all stages of the learning process, It is 
also reasonable to conclude that at the beginning 
of the instruction, the commission of one error 
increases the probability of another error. In 
typewriting, for example, a beginner hitting 
a wrong key is likely to repeat that error or to 
make other errors based on his erroneous men- 
tal picture of the keyboard. An experienced 
student is not so likely to make cumulative er- 
rors. This is another way of saying that the 
errors made by advanced learners are more 
nearly independent than the errors made by be- 
ginners. In this type of learning, the possible 
number of errors in one task is the same as that 
in another. It, therefore, appears that a distri- 
bution of errors made by learners, as the stage 
of perfect mastery is approached, meets all the 
requirements imposed by the basic assumptions 
underlying the Poisson distribution. If this isso, 
some interesting implications arise concerning 
a method of measuring the state of learning at- 
tained by a group being instructed in the type of 
task here discussed. When the distribution of 
errors takes the form of a Poisson distribution 
it seems reasonable to suggest the hypothesis 
that the point has been reached in the learning 
process where further instruction would yield 
minimal further gains in learning. A further 
hypothesis that such a distribution would depart 
from the Poisson form when instruction and prac- 
tice cease and forgetting sets in, would seem 
worthy of investigation. The hypothesis that 
further practice would then again bring such a 
distribution into the shape of the Poisson distri- 
bution would follow logically if the first two hy- 
potheses were found to be tenable, 

The relationship between the distribution of 
errors and the Poisson distribution was first 
pointed out by Shuster (14) in his study on the 
teaching of slide rule procedures. Shuster found 
that the number of errors made by his students 
at the end of the experiment were distributed in 
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accordance with the Poisson law. 

With this study as a cue a study was made of 
the errors made by students of typewriting (15). 

The subjects were students taking first, sec- 
ond, and third semester courses in typewriting in 
the Commerce Department of a large midwestern 
college. In each of these courses a bi-weekly test 
of constant length was given. The data covered a 
period of three years and included the results of 
nearly 9000 of these bi-weekly tests. An exami- 
nation of the average number of errors made per 
student per test showed that there was little 
change in this respect (the mean varied froma 
high of 8. 98 in the first semester to a low of 6. 29 
in the last test of the third semester). This fact 
is easily accounted for by the increased rates of 
speed on the part of the more advanced students. 
There was, however, a considerable change in 
the variance which dropped from 36, 47 inthe first 
test of the first semester to 7.15 in the last test 
of the third semester, 

The distribution of errors in the third sem- 
ester showed a clear correspondence to the Pois- 
son distribution and the chi-square test bore out 
this correspondence. 

A specific examination of the assumptions 
heretofore given as applied to this situation bears 
out the reasonableness of this correspondence be- 
tween the empirical distribution and the theoret- 
ical Poisson distribution. 

1. Discreteness. The distributions are dis- 
crete as an error is either made or itis not. 
There is no other possibility. Negative or frac- 
tional values are impossible in the situation con- 
sidered here. 

2. Possibility of a large number of occur- 
rences of the event. There are a very large num- 
ber of possibilities for errors to be made ineach 
test and this requirement is, therefore, rather 
obviously met, 

3. Small likelihood of the event occurring at 
any particular opportunity. As perfect mastery 
is approached by the student, the probability ofan 
error at any particular single opportunity becomes 
constantly smaller, It would, therefore, appear 
that the requirement that the probability of an e- 
vent at any particular instance should be small 
will be fulfilled by distributions of errors made 
by learners who are approaching the final stages 
of learning. 

4, Equality of opportunity for occurrence. 
Very little question could arise to cast doubt on 
the theory that each of the many items in the test 
present an equal opportunity for the occurrence 
of the event we are concerned with. 

5. Independence, As has already been noted, 
the kinds of errors made at the later stages of 
learning of a rote process are not dependent on 
other errors as errors at the early stages may 
be. The commission of an error by an advanced 
student does not make appreciably more likely the 
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commission of another error by him or by any 
other student. The requirement of complete in- 
dependence of the events with the frequency of 
which the distributions are concerned is, there- 
fore, substantially complied with at later stages 
of instruction. 

In the light of this examination of the manner 
in which these distributions meet the basic as- 
sumptions underlying the Poisson distribution it 
would appear reasonable that the frequency of the 
errors made by advanced students would fall in- 
to the form of the Poisson distribution while the 
earlier distributions do not. 

A similar study is reported by Davis (16) in 
which he found that data consisting of the number 
of typing strokes between errors fell into this 
form. 

Many other examples of Poisson distributions 
in the field of educational research could be given 
if space permitted. 

While preparing this paper the writer came 
across the following sentence in an article pur- 
porting to report an experiment in the teaching 
of reading: ‘‘With no emphasis upon the formal 
teaching reading whatever, the experimental 
groups outgained the comparison groups inread- 
ing achievement by an almost double score (9.5 
months to 5.8 months) at a statistical signifi- 
cance of less than 1%.’’ It is hoped that this dis- 
cussion of the Poisson distribution may contrib- 
ute to reports of educational research having a 
statistical significance of more than 1%. 
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FORECASTING FUTURE ENROLLMENT BY 
CURVE-FITTING TECHNIQUES 


H. P. KUANG* 
University of Minnesota 


WE KNOW relatively little in advance 
about the future, yet we are frequently making 
plans for the future. Every university president 
or every college dean is constantly interested in 
knowing how many dollars, buildings, numbers 
of the faculty and course offerings would be re- 
quired to provide instruction for the students 
who would be in attendance in one or in aseries 
of years in the future. This requires ina high 


degree the forecast of future college enrollment. 


There are a number of approaches which 
have been made to the problem of forecasting 
school enrollment. The approach as discussed 
here involves a curve-fitting technique. The 
curve-fitting procedure consists in determining 
a functional relationship which exists between 
years and the past enrollment records and then 
projecting this function to the year or years for 
which prediction is desired. It should be noted 
that this approach is made on the fundamentai 
assumption that the historical records provide 
a useful clue to the possible future enrollments. 
In other words, it is assumed that the present 
enrollment growth is fairly well indicated by 
past records, and that factors operating in the 
recent past will continue their influences in the 
future. 

The purpose of this paper is to suggest three 
types of curves by means of which the predic- 
tion of future enrollment might be made. It is 
not my attempt to review all growth curves, but 
merely to give a little suggestion on the prob- 
lem of estimating uncertain enrollments. 


A. The Exponential Curve 





The exponential function is applicable to the 
growth of enrollment. We shall first define the 
function as follows: 

If the values of t (say, year) are in arithme- 
tic progression and the corresponding values of 
y (say, enrollment) are in geometric progres- 
sion, then the relation between the variables t 
= y is said to be an exponential function, y = 
abt, 

Let t,, t, +At, t, +24t 
At be an arithmetic series and let y,, ry,, r@y, 

rn-ly, be a geometric series. Then the 
tnth term of the arithmetic series is tp = t, + 





(n-1)4t or n-1 = ‘n-th . 
at 
ynth term of the geometric series is yp = rn-ly,, 


Similarly, the 


tet, = sth sd 
Hence Yn=y,rAt =y,r4t (rt) n 


Yn = abtn 


-t 1 
where a-=y,rt and b=r4t 


Since any. t is connected with the correspond- 
ing y the equation (1) becomes 


y = abt (2) 


The exponential curve may be fitted readily 
by reducing it to logarithmic form. Taking log- 
arithms of the above expression (2), the equa- 
tion becomes 


log y = log a + t log b (3) 


which is a straight-line trend of log values. 

According to the least-square principle we 
may minimize the residuals by differentiating 
in turn with respect to log a and log b. We then 
obtain the solution of the logarithms exponential 
equation (3): 





log a = Lt?D log y - Dt&t log y 
not? -(st? 


log b = nz=t log y - tz log y 
nut? - (xt)? 





In case the data are not given with the values 
of the independent variable in arithmetic pro- 
gression it may be readily proved that if the ra- 
tio of the first difference of log y to the first 
difference of t is constant, then the relation be- 
tween t and y can still be expressed by the same 
exponential function (2). 


#The author is grateful to Professor Palmer 0. Johnson, who read critically the draft of this paper. 
Thanks are also due to Dean R. E. Summers through whose courtesy the author was permitted to spend a 
short period of time examining the trend of the enrollment data in his office. 
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B. The Gompertz Curve 





This curve was first reported in i825ina 
paper read by Benjamin Gompertz, a Jewish 
statistician who worked with a life insurance 
company in England. The general formula for 


the curve is 


qt 


y=d.gl (4) 


in which y represents measure of growth at the 
time t; d. g and q being constants. Essentially 
the Gompertz curve is of the exponential type. 
The exponential function as discussed above is 
a linear trend of log values while the Gompertz 
curve is a linear trend of log log values. To il- 
lustrate this relationship we simply put d= 1 
and take the logarithm of both sides of the equa- 
tion (4) 


log y = qt log g 
Taking the logarithm again, we have 
log log y = t log q+ log log g 


which is the equation of a straight lineinterms 
of log log values. 

Hence, it is evident that the Gompertz func- 
tion is expression of the nature of enrollment 
growth if the relationship between time and the 
log log values of the enrollment is linear. In 
order to fit the enrollment data to the Gompertz 
curve, we have to determine the constants d, g 
andq. The three constants may be estimated 
from three equally spaced points whichare so 
chosen as to represent the characteristic of the 
data. Let the three points be (t,, Yo)(t, + 1, y,) 
(t, + 2,y2). Substituting in the equation (4), we 
have the following three equations: 


Yo=d. gq"! 
t,+1 


y,=d. gf 


ve = 4. eq' +2 

To find values for d, g and q in terms of yo, 
y, and y, we take the logarithm of both sides of 
each of the three equations and solve them. The 
value for d, g and q for fitting the Gompertz 
curve are as follows: 


log d= 10g Yo - log y, 
g ari 
qt(q~-1) 
- logy, - logy, 
log y, ~ log Yo 





C. The Logistic Curve 





In 1838 Professor P. F. Verhulst, a Bel- 
gian mathematician, published a memoir sug - 
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gesting the use of a curve, which he called the 
logistic, to describe the growth of popu lation. 
His work was forgotten for seventy years, but 
since the publication of Du Pasquier’s paper in 
1918, the logistic curve has been widely usedin 
the study of population gorwth. Since there is 
a marked relation between the past growth of 
population and the growth of enrollment, it is 
not novel to apply the logistic curve in the fit- 
ting of enrollment trend. 

The general logistic function may be given 
by 


L 
y= 
: 1 + me V(t) 


where @ (t) = S,t + S,t? + S,t? + 
y denotes the growing variable, enrollment, t 
denotes time, year; m,L and S’s are constants. 
Several methods have been developed for fit- 
fing this curve. The first of these, which may 
be called the method of the rate of increase, is 
based upon the fundamental logistic property 
that the percentage of rate of increase isa linear 
function of the growing variatle, say enrollment. 
We may calculate the percentage of increases 
over successive intervals and plot these to the 
enrollments ai the end of the intervals, then, 
fittinga line to these points, say by least squares. 
The application of this procedure is given by H. 
Hotelling. Another method, which may be called 
the method of selected points, is used by Ver- 
hulst and by Pearland Reed. This method con- 
sists essentially of a preliminary estimate of 
the constants or parameters fromasetof points. 
The Pearl-Reed procedure for estimating the 
constants is limitedto five parameters. It should 
be pointed out, however, that any number of 
parameters may be determined by a series of 
points. Suppose we wish to determine n+2 par- 
ameters by use of n+2 points (0, yo)(t,, ¥; (te, yo) 
(tn, Yn\(tn+i,¥n+1), we have from (5): 


me Ot). L-y 
y 
@ (t) = log (L -y) - log m - logy = log : 
log m 
Since @(t) = S,t + S,t? + 


Let log m = So 
Then So + S,t + Szt? + 


Now if the selected n+2 points pass through the 
curve we can determine n+2 parameters, So, S,, 
Sn and L by the following equations: 


So - log L-Yo =o 
yo 
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So + t,S, + t?S, + 


L - Yn+1 
Yn+l 


=O 


The values of these parameters may be denoted 
in terms of determinants or can be obtained by 
any of the algebraic methods for solving simul- 
taneous equations, such as the Doolittle method. 
The solution of these equations can be treated 
as a first approximation to the best fitting log- 
istic curve. If a better fit is desired than that 
given by these selected points, we may use the 
second approximation by minimizing the sum of 
Squares of the residuals; and the process may 
be repeated a number of times. This process 
does not, however, minimize the sum of the 
squares of the differences between the observed 
and expected values. Hence, strictly speaking, 
the procedure, as called by Pearl and Reed the 
method of least square, is not true least square 
procedure in the usual sense. The details of 
the least square fitting are given by Henry 
Schultz (ref. 4). 

A good many statisticians and mathematicians 
in a good many lands have attempted to devise 
a curve for the growth phenomenon. Mitscher- 
lich of Germany, Robertson of Australia, Yule 
of England have all used a type of curve to de- 
fine growth. There is no demonstration of the 
applicability of these curves, so far as I am 
aware, to the enrollment growth. 

In projecting future enrollment it is import- 
ant to observe whether or not the curve fitted is 
the curve which ought to be fitted. The test of 
goodness of fit is desirable in order to see if 
the enrollment trend for a given period is a rep- 
resentative one. The task of predicting future 
enrollment for a long period often becomes in- 
creasingly difficult because of a good many in- 
dependent variables, not knownat present, which 
may affect them in the future. In sucha case, 
the predictions become cumulatively less re- 
liable as time goes farther and farther. Hence, 
when an enrollment growth curve has been suc- 
cessfully fitted to a long series of data, it is 
preferable to compute the estimated enroll- 
ment for a short period of years as a tentative 
forecast. Such estimates may be proved to be 
close enough for most practical purposes, but 
special circumstances may modify their accur- 
racy. 
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It appears to me that the problem of fore- 
casting in statistics seems to be pretty much 
the same as the problem of being in philosophy. 
All of the famous philosophers had failed to solve 
the problem of being: Plato tried to solve itand 
failed; Aristotle triec and failed. In the fieldof 
sciences, many Statisticians and mathematic- 
ians had tried to solve the problem of future 
and failed. For instance, Professor Verhulst 
himself had failed in fitting the population of 
Russia from 1796 to 1827 to a curve and aban- 
doned the investigation of the curve of population. 
More recently Dr. T. C. Hungate of the Teach- 
ers College of Columbia University failed to 
prophesy the total college enrollment of 1950 in 
his book, Financing the Future of Higher Educa- 
tion, which was published in 1946. Professor 
R. S. Vaile of the University of Minnesota failed 
to foretell the 1948 enrollment in his article, 
‘‘Enrollment After the War, ”’ published in 1944. 
The failure of forecasts from a curve to agree 
with the actual enrollment may be due either to 
the choice of the wrong type of curve or to our 
inability to control all factors in operation. Sta- 
tisticians have lost their reputations by fore- 
casting an uncertain event, such as predicting 
future college enrollment. Whenever this hap- 
pens, there has been often a tendency for people 
to draw a hasty conclusion that statistics is no 
good. In view of this widespread criticism on 
statistics, it is deemed advisable to point out 
that there is no necessary relation between the 
goodness of fit of a mathematical curve and its 
reliability for forecasting. A curve may fit the 
enrollment data for the past fifty years with a 
high degree of accuracy, and yet fail to predict 
the future enrollment for the for the next year 
or two. 

The forecasting of future enrollment is not 
simply a statistical problem and therefore per- 
haps never can be settled by statistics alone. 
The statistical estimate gained from the study 
of past enrollment records must be supplement- 
ed by other knowledge which may be quite non- 
Statistical in nature. The factors actually rep- 
resented by the enrollment trend are very com- 
plex and difficult to analyze. In fact, the future 
college enrollment involves the operations of a 
rather large number of factors. For example, 
the number of registered students ina given 
college may be large or small depending upon 
changes in international situations and in the ec- 
onomic condition, veteran enrollments, birth 
rates, high school enrollments, unusual migra- 
tion, provisions for educational benefits, poli- 
cies concerning student deferment and many 
other numerous factors. Some of these factors 
may operate continuously for a number of years 
and some may drop out temporarily or perma- 
nently while some unknown new ones may make 
their appearance in the future. Itappears, 








274 JOURNAL OF EXPERIMENTAL EDUCATION 


therefore, impossible for the statisticians to 
control all new factors unknown to them andcon- 
sequently every statistician may make errors 

in probabilities at times. 

In a certain sense an enrollment trend may 
be considered as an established habit; it is like- 
ly to continue, but other factors may change it. 
The element of personal judgment based upon 
an understanding of the factors affecting the en- 
rollment is always invaluable. Each college 
should study most realistically its enrollment 
outlook in the light of past, present and future 
conditions which are likely to affect its enroll- 
ment record. If and when some estimates of 
the future enrollment must be made and no other 
device can meet the need as effectively as the 
fitting a trend, then projecting future enrollment 
under normal growth conditions by curve-fitting 
procedure as suggested in this paper may be ex- 
ceedingly useful for the practical purposes. 
Finally, it should be realized that the object of 
forecasting enrollment is not to determine a 
curve that will tell exactly what must happen in 
the future years, but it is to make a careful an- 
alysis, based on an appropriate statistical tech- 
nique, which will enable the administrator to 
take into account future enrollments to a great- 





er extent than he could without them. 
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AN APPLICATION OF MULTIPLE REGRESSION 
ANALYSIS IN DETERMINING THE RELATIVE 
CONTRIBUTION OF CERTAIN COMPONENTS 

OF READING ABILITY IN GRADE AND 
HIGH SCHOOL ACHIEVEMENT 


LAVERN L. KRANTZ 
Ohio University 
Athens, Ohio 


THE PURPOSE of the present article is 
to show how multiple regression analysis can 
be applied to reveal the relationships between 
reading and basic skills as measured in the 
seventh grade (the independent variables) and 
success in certain phases of the academic pro- 
gram at the high school level (the dependent var- 
iables). For complete results of this investiga- 
tion the reader is referred to a doctoral disser- 
tation completed in 1954.1 The main purpose 
of the study was to determine the significant 
partial regression coefficients and determine 
their share of the accountable variance. 

The study was designed to find the relation- 
ships among measured areas of seventh-grade 
reading abilities and study skills and the con- 
tent areas of the high school. The students of 
the Austin Public Schools constitute the popula- 
tion of this study. It is to this population that 
the conclusions are drawn. The samples con- 
sisted of the pupils from two seventh-grade 
classes. Data were obtained from the perma- 
nent records of the pupils. The first seventh- 
grade class tested in 1947 was tested again in 
1952 when in the eleventh grade. The second 
seventh-grade was tested in 1949 and again in 
1952 when in the ninth grade. The samples of 
471 pupils included in the study, consisted of 
215 pupils in the first group (7-11), and 256 pu- 
ils in the second group (7-9). 

Tests used in the study were the California 
Intelligence Tests, (non-language section), 1946 
Revision Grades 9 to Adult; the Iowa Tests of 
Basic Skills, Form Z and Form R; and the Iowa 
Tests of Educational Development, Form Y-2. 

The reading abilities and work study skills 
measured by the iowa Tests of Basic Skills in 
the Seventh Grade and their notation (Independ- 
ent Variables) were as follows: 


Xj; Reading Comprehension 
X2q Reading Vocabulary 


1. LaVern L. Krantz. The Relationship of Reading 








Map Reading 
Use of References 
Use of Index 
Use of Dictionary 
Reading Graphs, Tables, 
and Charts 
Total Work Study Skills 
Usage (language) 
Spelling 
Total Language Skills 
X10 Fundamental Knowledge in 
Arithmetic 
X11 Fundamental Operations 
X12 Problems 
T3 Total Arithmetic 


The measures in content areas at high 
school level, as obtained from scores on the 
Iowa Tests of Educational Development (Depend- 
ent Variables), were as follows: 


Understanding of Basic Social Concepts 
Background in the Natural Sciences 
Correctness and Appropriateness of 
Expression 
Ability to do Quantitative Thinking 
Ability to Interpret Reading Materials 
in the Social Studies 
Ability to Interpret Reading Materials 
in the Natural Sciences 
Ability to Interpret Literary Materials 
General Vocabulary 
Total of Above 

Yi0 Use of Sources of Information 


The assumption underlying the validity of the 
use of multiple regression analysis is that the 
dependent variable is normally distributed and 
that the regression of the dependent variable on 
the linear combination of independent variates 
is linear. These assuinptions were tested em- 
pirically by plotting the dependent variables on 


Abilities and Basic Skills of the Elemen 


aa School 


to Success in the Interpretation of the Content er 





dissertation, University of Minnesota, 105]. 
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normal probability paper indicating normality 
by a straight line relationship. 

In order to handle the large amount of data 
efficiently and accurately, it was transferred 
to the Hollerith card which permitted the ma- 
chine calculation of the squares, sums of squares, 
and cross products essential to the multiple re- 
gression analysis. 

The first step in the analysis of the data was 
to determine the zero order correlations and 
test their significance at the one percent level 
of probability. The null hypothesis that the cor- 
relation coefficients are not significantly differ- 
ent than zero was tested. 

Two main areas were investigated in each 
grade using the independent variates and depend- 
ent variables listed. The first multiple regres- 
sion analysis was set up using Xj, X2, X3, T1, 
T2, T3, and their relationship with each of the 
dependent variables Y, to Yjg determined. The 
second analysis was made using X4, X5, X¢, 
X7, and Xg in a similar manner with the depend- 
ent variables. -It is important to point out at 
this time that no relationship between the two 
can be discussed because each one is determin- 
ed from a separate matrix. The first analysis 
was used to obtain total relationships and the 
second one to obtain special relationships that 
may exist between each of the work study skills 
and the dependent variables. The second main 
area was a repetition of the first performed an- 
alysis for grade eleven concucted in a similar 
fashion. 


An Example of the Application of Multiple Re- 
gression Analysis in Calculating the Reading 
and Study Skills Related to Y2 











The regression equation is in terms of the 
population values or parameters. The best es- 
timates of these parameters are obtained from 
the sample values. The partial regression co- 
efficients represent the weighted contribution 
that each independent variate makes to the total 
estimate of the dependent variable. The equa- 
tion will predict the average change in the de- 
pendent variable for a unit change in anyone of 
the independent variates. The predictive factors 
involved in making a best estimate of future suc- 
cess in the problem under investigation is of 
considerable interest to those concerned with 
the teaching process. 

The correlation between the criterion, or val- 
ue we are trying to predict, and the prediction 
obtained by using the best weightings for the 
partial regression coefficients is called the mul- 
tiple correlation coefficient. The multiple cor- 
relation coefficient gives an over-all evaluation 
of the goodness of fit of the multiple regression 
model, It is an indication of the degree of linear 
relationship between the actual scores and the 
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best predicted scores. A test of statistical sig- 
nificance was applied to all multiple correlation 
coefficients at the one percent level of probabil- 
ity. If the coefficients were found not signifi- 
cantly different than zero the relationship could 
be attributed to the operation of chance factors. 
The square of multiple correlation coefficient 
gives an accurate measurement of the total 
amount of variance of the dependent variable 
mathematically accounted for by the linear com- 
bination of the independent variables. The di- 
rect contribution of each variate is obtained 
from the square of the standard partial regres- 
sion coefficient for each variate. A major por- 
tion of the interpretation of results is based on 
this fact. 

The next logical step after finding the read- 
ing abilities and skills apparently related toa 
particular high school content area was to test 
the significance of each relationship. When the 
above relationships cannot be explained on the 
basis of sampling errors alone they are said to 
be statistically significant and are, as a conse- 
quence, retained in the regression equation. 
An arbitrary probability level was established 
at the five percent level for the retention of re- 
gression coefficients used in determining the 
assignable variance. 

The problem investigated required for its so- 
lution, not only the establishment of the exist- 
ence of a relationship between reading abilities 
and skills and the subject areas in high school 
but the extent of that relationship. While two 
or more abilities may be contributing to the to- 
tal variance of the dependent variable in high 
school it must be ascertained if they contribute 
Similar amounts, different amounts, and to what 
degree. All partial regression coefficients 
stable at the five percent level, or one percent 
level of statistical significance as dictated by 
the problem solution, were tested further to de- 
termine if there was a difference in the amount 
of variance accounted for between them. If any 
differences can be accounted for on the basis of 
random errors of sampling, it is not possible 
to make any definite statements regarding the 
unique amount of contribution made by each one. 
If the difference between them cannot be account- 
ed for by sampling error, it is inferred that one 
may make a greater contribution than the other. 

The method of determining the amount of con- 
tribution of each variate to the total variance is 
shown in the sample analysis. The partial re- 
gression coefficients are changed to standard 
partial regression coefficients, squared, and 
the percent of assignable variance determined. 
The calculations have been completed in the ex- 
ample to illustrate the procedures used. 

Steps commonly followed in the solution 
of a multiple regression problem in a stable so- 
lution at the five percent level of statisticalsig- 
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nificance for the regression coefficients are not 
shown and may be found in other sources. 2 The 
computations below carry the solution on using 
the partial regression coefficients found to be 
Significant. In the example used it was found 
that the partial regression coefficients for inde- 
pendent variables Tj (Work Study Skills), T3 
(Total Arithmetic Skills), and X3 (Vocabulary) 
were significantly related to Y2 (Background in 
Natural Science) and accounted for 51 percent of 
the variance. . 

Our first requirement is that of converting 
the partial regression coefficients to standard 
partial regression coefficients.3 This is re- 
quired in order to render the measurements in- 
dependent of scale. The process of transforma- 
tion is indicated below where A(i = 1,2,3) rep- 
resents the respective standard partial regres- 


sion coefficients. 
=. 1301 /85714. 7343 - . 3929 
7203. 2773 
=.1071 /24153. 4960 4960 _ 
7203. 2773 
=. 0522 /93114. 3398 - . 1877 
7203. 2773 


Bi = 


Ba 963 


Work Study Skills 
Total Arithmetic Skills 
Vocabulary 





SP? 


. 1543 
. 0385 
. 0352 


2. Palmer O. Johnson. 


2280 





36 Palmer 0. Johnson. 
XXX (October 1936), pp. 93-103. 


KRANTZ 


We next square each of the standard partial 
regression coefficients and calculate the propor- 
tion that each square constitutes of the total var- 
iance, i.e., R?. (See table below.) The change 
to a percentage basis is indicated in the third 
column. 


Summary and Conclusions 





Multiple regression analysis was applied to 
determine the relationships among seventh- 
grade reading abilities and study skills and 
achievement in the content areas of the high 
school. 

After running the usual analysis and finding 
the partial regression coefficients stable at the 
five percent level of statistical significance, fur- 
ther treatment was applied to determine the ex- 
tent of the relationships. This was accomplish- 
ed by testing the significance of the difference 
between partial regression coefficients to ascer- 
tain if a real difference in their respective con- 
tributions existed or if it could be accounted for 
by random errors of sampling. Anexample 
was shown illustrating the method of calculating 
the percentage of the total variance accounted 
for by each component. The above procedures 
used in treating the data were of basic import- 
ance in the interpretation of the results. 


Proportion Percent 
. 6767 
. 1688 
. 1543 


. 9998 


Statistical Methods in Research (New York: Prentice-Hall, Ince, 1949), 377 pp. 
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A REDUCED FORMULA FOR POINT-BISERIAL 
COEFFICIENT OF CORRELATION 


WILLIAM O. BUSCHMAN 
Portland State Extension Center 
Portland, Oregon 


TWO OF THE most commontypes of 
correlation coefficients for the purpose of item 
analysis are biserial r and point-biserial r. 
The first of these, biserial r, assumes that the 
item is represented by a dichotomization of a 
continuous and normally distributed variable and 
that the criterion is a continuous but not neces - 
sarily normal variable. Whether or not these 
assumptions can be justified is often question - 
able. Richardson and Stalnaker (5:463-64) in- 
dicate that the biserial r is not a suitable meas- 
ure for item analysis where the responses are 
dichotomized because it assumes:(1) that 
the distribution of the dichotomized variable is 
normal, and that (2) the dichotomized variable 
is continuous. They indicate that unless these 
conditions can be established and be shown to 
fit the case being studied, that use of the biser- 
ial r is not warranted. In such cases the point- 
biserial r is more suitable. 

The laborious computations which are re- 
quired for most methods of item analysis have 
been a deterrent to their use. In order to re- 
duce the labor of computing the biserial coeffi- 
cient, Flanagan devised a procedure using on- 
ly the upper and lower portions of the group (3: 
674-80). Dunlap (2:51-58), Chapanis(1:297-304) 
and others have proposed methods for shorten - 
ing the amount of computation in determining the 
biserial index, Such shortened methods are pre 
ferable if they yield equally satisfactory results, 
and in some cases may be preferred because of 
the saving in time and effort even though a cer- 
tain degree of accuracy may be sacrificed. Are- 
duced formula for the point biserial coefficient 
is developed and shown below. 

The point-biserial coefficient of correlation 
is given by the following formula (4:353): 


M_ - Mg (1) 


Trpb = on ¥ Pq 


p = the proportion of individuals passing the 
item, 


q = the proportion of individuals failing the 
item. 


For purposes of comparing the validity of items 
on a single test and not for comparison with items 
on other tests, the reduction to standard form by 
division by o7 is unnecessary and may be omitted. 
This gives what will be termed the reduced point- 
biserial coefficient of correlation. The formula 
for this may then be written: 


Trpb = (Mp ~ Mf) ¥V pq 


(2) 


The difference My - Mg is as adequate as Mp ~- 
My (1:303). Thus the formula can be expressed in 
the alternative form: 


Tepb = (My - My) Pa (3) 


in which My is the mean of the total distribution 
of scores. 

If deviations are taken from the same step in 
each case, then the means Mt and My may be re- 
placed by the deviations of the true means from 
the guessed means as the differences will be can- 
celled in the subtraction My ~ Mg. The formula 
can then be expressed: 

Zftd Lf¢d (4) 
Trpb*\ Ne Ng | YPO 


in which N¢ represents the total number of scores 
and Nf the number failing the item. This is a more 
usable form than either (2) or (3) above. Further 
simplification of the computations are possible by 
taking the deviations (d) in terms of class inter- 
vals since the class interval is constant throug h- 
out any problem. If the zero deviation is taken at 


the lowest interval in which any frequencies occur, 
the negative deviations may be avoided. 

The reduced formula will give values which are 
just as accurate and adequate as the formulas from 
which they are derived. Though results can be ap- 
plied only to the problem on which the values are 
computed, this is the way in which this coefficient 
is generally used. If comparisons with other sets 
of data are sought, the data obtained by the re- 
duced formula can be converted to the conventional 


in which: 
Mp = the mean score of the individuals pass- 
ing the item, 


M, = the mean score of individuals failing 
the item, 


Oy = the standard deviation of the total dis- 
tribution of scores, 
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formulas, but only in rare cases wouldthis 
furnish useful information. 

The development of the reduced for mof the 
formula for the point-biserial coefficient of cor- 
relation (4) will make it easier to use andapply 
this coefficient to those situations in which it is 
the most suitable of the several measures which 
are used for item analysis. 
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